In [1]:
import pymongo
import pandas
from pprint import pprint

# Connect
client = pymongo.MongoClient(host="mongo", port=27017, username="imdb", password="imdb_admin")

In [2]:
df1 = pandas.read_csv("IMDB-movies.csv")
df2 = pandas.read_csv("IMDB-directors.csv")
df3 = pandas.read_csv("IMDB-movies_directors.csv")
df4 = pandas.read_csv("IMDB-movies_genres.csv")
df5 = df4.merge(df1).merge(df3).merge(df2)

df4 = df4.merge(df1)
df3 = df3.merge(df1).merge(df2)

db = client['imdb_database']
collection1 = db['imdb_movies']
collection2 = db['imdb_directors']
collection3 = db['imdb_movies_directors']
collection4 = db['imdb_movies_genres']
collection5 = db['imdb_movies_directors_genres']


df1.reset_index(inplace=True)
data_dict = df1.to_dict("records")# Insert collection
collection1.insert_many(data_dict)

df2.reset_index(inplace=True)
data_dict = df2.to_dict("records")# Insert collection
collection2.insert_many(data_dict)

df3.reset_index(inplace=True)
data_dict = df3.to_dict("records")# Insert collection
collection3.insert_many(data_dict)

df4.reset_index(inplace=True)
data_dict = df4.to_dict("records")# Insert collection
collection4.insert_many(data_dict)

df5.reset_index(inplace=True)
data_dict = df5.to_dict("records")# Insert collection
collection5.insert_many(data_dict)
print("done!")

done!


# Introduction to MongoDB (MongoDB Query Language)

##### Version 0.1

***

By Scott Coughlin (Northwestern IT Research Computing Services)  
20 July 2022

In our introduction to MongoDB we will start with queries of existing tables.

## Problem 1) IMDb Data

Throughout the session we will use information from the [Internet Movie Database (IMDb)](https://www.imdb.com/) to illustrate various principles regarding databases.

A quick note on the provenance of this data. The files we have used to populate this data set are from [this website](https://relational.fit.cvut.cz/dataset/IMDb) and it may not be a list of every single movie on IMDb (there are no movies after 2004).

For this exercise there are 5 collections, 
```
collection1 = db['imdb_movies']
collection2 = db['imdb_directors']
collection3 = db['imdb_movies_directors']
collection4 = db['imdb_movies_genres']
collection5 = db['imdb_movies_directors_genres']
```
To make things simple, I have already performed the necessary steps to "join" the information from imdb_movies and imdb_directories together to make a bigger collection "imdb_movies_directors" and so on

#### HELPFUL TIP: Convery all resulting queries to a pandas.DataFrame by wrapping the `pymongo` query in

```
df = pandas.DataFrame(list(db.imdb_movies_genres.find())
print(df)
	_id 	index 	director_id 	first_name 	last_name
0 	62da05f8e5d6d03453887957 	70115 	71645 	Martin 	Scorsese
```

### Second Helpful Tip: See the MongoDB SQL to Mongo mapping information to help: https://www.mongodb.com/docs/manual/reference/sql-comparison/

In [5]:
pandas.DataFrame(list(db.imdb_directors.find({"last_name" : "Scorsese"})))

Unnamed: 0,_id,index,director_id,first_name,last_name
0,665f83e93eea3e6d1b53095d,70115,71645,Martin,Scorsese


**Problem 1a**

Using pymongo, SELECT 10 movies from the imbd_movies table. Select 10 directors from imbd_directors and order by `first_name`.

In [7]:
[i for i in collection1.find().limit(10)]

[{'_id': ObjectId('665f83e63eea3e6d1b4c8c30'),
  'index': 0,
  'movie_id': 0,
  'name': '#28',
  'year': 2002,
  'rank': 0.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c31'),
  'index': 1,
  'movie_id': 1,
  'name': '#7 Train: An Immigrant Journey, The',
  'year': 2000,
  'rank': 0.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c32'),
  'index': 2,
  'movie_id': 2,
  'name': '$',
  'year': 1971,
  'rank': 6.4},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c33'),
  'index': 3,
  'movie_id': 3,
  'name': '$1,000 Reward',
  'year': 1913,
  'rank': 0.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c34'),
  'index': 4,
  'movie_id': 4,
  'name': '$1,000 Reward',
  'year': 1915,
  'rank': 0.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c35'),
  'index': 5,
  'movie_id': 5,
  'name': '$1,000 Reward',
  'year': 1923,
  'rank': 0.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c8c36'),
  'index': 6,
  'movie_id': 6,
  'name': '$1,000,000 Duck',
  'year': 1971,
  'rank': 5.0},
 {'_id': ObjectId('665f83e63eea3e6d1b4c

In [14]:
[i for i in collection2.find({}, {'first_name': 1, '_id': 0}).sort('first_name', 1).limit(10)]

[{'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'},
 {'first_name': 'A.'}]

**Problem 1b**

Using pymongo, how many movies are there? How many directors are there? 

In [15]:
collection1.count_documents({})

355146

In [16]:
collection2.count_documents({})

86880

*Write your answer here*

**Problem 1c**

Using pymongo, determine how many movies are there after the year 2000?

In [18]:
collection1.count_documents({'year': {"$gt": 2000}})

39586

*Write your answer here*

**Problem 1d**

How many different movie genres are there?

In [19]:
collection4.distinct('genre')

['Action',
 'Adult',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Music',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Short',
 'Thriller',
 'War',
 'Western']

*Write your answer here*

## Problem 2) Groups and Aggregates

Now that we know why the data has been organized in this way, we can leverage this unique structure in order to learn interesting properties of the data. 

**Problem 2a**

In which year were the most movies made according to IMDb?

In [24]:
column_name = 'year'  # Replace with your column name

# Step 4: Use aggregation to find the most common value
pipeline = [
    {"$group": {"_id": f"${column_name}", "count": {"$sum": 1}}},
    {"$sort": {"count": -1}},
    {"$limit": 10}
]

list(collection1.aggregate(pipeline))

[{'_id': 2002, 'count': 10337},
 {'_id': 2003, 'count': 10119},
 {'_id': 2000, 'count': 10107},
 {'_id': 2001, 'count': 10002},
 {'_id': 1999, 'count': 9389},
 {'_id': 1998, 'count': 8636},
 {'_id': 1997, 'count': 7748},
 {'_id': 2004, 'count': 7558},
 {'_id': 1996, 'count': 7275},
 {'_id': 1995, 'count': 6923}]

*write your answer here*

**Problem 2b**

How many "Action" movies where made after the year 1980? Before the year 1980?

*write your answer here*

**Problem 3c**

Select all films made by `Scorsese`. How many are there?

*write your answer here*

**Problem 3c**

According the the IMDb data, which director has directed the most movies?

*write your answer here*

**Problem 3d**

According the the IMDb data, which director has directed the most movies in each genre?

*write your answer here*

## Challenge Problem) Make your own tables

**Problem 1a**

Create a new "collection".

**Problem 1b**

INSERT 3 "documentions" into the "collection you made above

**Problem 1c**

Create a pandas DataFrame and save the rows as "documents" in a new "collection" you make

*** hint look at using `pandas.to_dict` ****