# BMIS-2542: Data Programming Essentials with Python 
##### Katz Graduate School of Business, Fall 2019


## Session-5: Data Wrangling Exercises
***

In this notebook, we are going to apply some of the techniques we learned on data wrangling, to analyze a few datasets with Python.<br>
We can  spend some time to work on them first, and then we can discuss sample solutions.  

In [None]:
# load required modules
import pandas as pd
import numpy as np

## <mark>MovieLens 1M Dataset</mark>

[GroupLens Research](https://grouplens.org) provides movie rating data collected in the late 90s and early 2000s from [MovieLens](https://movielens.org/) website, that helps people find movies to watch. The data provides movie ratings, movie metadata (i.e., genre and year), and demographic data about the users (age, zip code, gender, and occupation), which has often been used to develop recommendation systems. 

**MovieLens 1M Dataset** contains 1 million ratings collected from 6,000 users on 4,000 movies. It comes with three tables:`ratings`, `user information`, and `movie information`.

**Dataset**: 
 - Download `ml-1m.zip` either from CourseWeb or from [here](https://grouplens.org/datasets/movielens/)
 - Extract the zip file and save the 3 data files in a directory inside your Jupyter working directory (e.g., `Data/ML1M/`)

A movie is considered **active**, if it has received at least 250 ratings.<br> We aim to analyze the following <u>only for the active movies</u>.<br>

### Data Wrangling Objectives:
1. Find the movies that elicited the most rating disagreement among viewers
2. Compute the mean movie ratings for each movie, grouped by gender.
3. Obtain the top 10 movies among the female viewers
4. Find the movies that are most divisive (i.e., rating disagreement) between male and female viewers

As you can see, the data comes as `.DAT` files which essentially contain raw data.<br>
It is difficult to tell whether the DAT file contains text, pictures, videos, or any other configuration files for software applications. Therefore, how to open a DAT files vary depending on what it contains.<br>
Nevertheless, most of the time, DAT files are in plain text format so we can open these in a text editor.<br>

- Load each DAT file on a text editor such as Notepad and inspect the file content.
- Next, load the `README` file on a text editor. This is the file that provides information about what is stored in the data files. <br>**Inspect its content and note down anything you notice that may be important for the data wrangling process.**

Next, let's load each DAT file into a `DataFrame` so they can be examined with Python.<br>
We can use panda's `read_csv` method to do so.

**Users**: As given in the README file, `users.dat` comes in the format `UserID::Gender::Age::Occupation::Zip-code`. <br>It does not contain column names (i.e., headers). <br>Therefore, let's load `users.dat` into a `DataFrame` with proper headings first.

In [34]:
import pandas as pd
import numpy as np
unames = ['UserId','gender','age','application','zip']
users = pd.read_csv('C:/Users/Sunny/Documents/py_jupyter/ml-1m/ml-1m/users.dat', sep = '::' , header = None , names = unames, engine = 'python' )





Similarly load the rest of the DAT files.

In [30]:
# movies
mnames = ['MovieID','Title','Genres']
movers = pd.read_csv('C:/Users/Sunny/Documents/py_jupyter/ml-1m/ml-1m/movies.dat', sep = '::' , header = None , names = mnames, engine = 'python' )
movers.head(5)


Unnamed: 0,MovieID,Title,Genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [32]:
# ratings
rating = ['UserID','MovieID','Rating','Timestamp']
raters = pd.read_csv('C:/Users/Sunny/Documents/py_jupyter/ml-1m/ml-1m/ratings.dat', sep = '::' , header = None , names = rating, engine = 'python' )

Analyzing data would be much easier if all the data can be merged together in a single table.<br>
To do this, we can first merge the `ratings` with the `users` and then merge the results with the `movies` data.
Pandas infer which columns to use as the merge (i.e., join) keys, based on overlapping column names.

In [36]:
df =  pd.merge(users, raters)

dfMerged

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

#### Recognizing the Active Movies
As the analysis needs to be performed only for the active movies, we need to recognize active movies first.<br>
We can do this, first by computing the rating count for each movie, and then by filtering out the movies that received less than 250 ratings.

In [24]:
# obtain the rating count for each title. 


In [None]:
# Filter out the series to select the titles with >=250 ratings


#### Movies with the Most Rating Disagreement Among Viewers
Rating disagreement among viewers may be measured by the variance or standard deviation of the ratings.

In [None]:
# Standard deviation of rating grouped by title


In [None]:
# Filter down to active_titles


In [None]:
# Order Series by value in descending order


#### Computing the Mean Ratings

In [None]:
# Step 1: Create a pivot table to compute the mean ratings for each movie by gender


In [None]:
# Step 2: Select the active movies


#### Top Movies Among Female Viewers
What we can do to obatain this?

#### Movies that are Most Divisive between Male and Female Viewers

One way is to examine the difference in mean ratings for each gender category.<br>
We can add the new column `diff` to the `mean_ratings` DataFrame to hold the difference in mean ratings for each movie.

In [None]:
# add the new column 'diff' to mean_ratings dataframe

Sorting by `diff` yields the movies with the greatest rating difference so that we can see which ones were preferred by women, but not rated by men as highly.

In [None]:
# sort by 'diff' and slice the top 10


By reversing the order of the rows sorted according to `diff` above, and then by slicing the top 10 rows, we can get the movies preferred by men, that women did not rate as highly.

In [None]:
# reverse the sorted records and slice the top 10