# Introduction to interactive data mining with Python and Pandas

Author: Valentin Barriere

Objective: Understand data content, structure, problems (missing/aberrant data)

Inspired by the work of Alexandre Gramfort

### Data:

MovieLens 1M Data Set contains ratings given to films by users of the Movielens site. The data is provided, but can be found, if required, at: http://grouplens.org/datasets/movielens/

### Loading useful packages

In [1]:
import pandas as pd  #for data exploration
import numpy as np   #for numerical operations (Matlab type)

### Reading user data in a Pandas DataFrame

In [2]:
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')

In [3]:
users.head() # head() displays only the first lines for viewing

Unnamed: 0,user_id,gender,age,occupation,zip
0,1,F,1,10,48067
1,2,M,56,16,70072
2,3,M,25,15,55117
3,4,M,45,7,2460
4,5,M,25,20,55455


### Reading rating data in a Pandas DataFrame

In [4]:
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames, engine='python')

In [5]:
ratings.head(10)

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291
5,1,1197,3,978302268
6,1,1287,5,978302039
7,1,2804,5,978300719
8,1,594,4,978302268
9,1,919,4,978301368


### Reading film data in a Pandas DataFrame

In [6]:
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames, engine='python')

In [7]:
movies.head(10)

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy
5,6,Heat (1995),Action|Crime|Thriller
6,7,Sabrina (1995),Comedy|Romance
7,8,Tom and Huck (1995),Adventure|Children's
8,9,Sudden Death (1995),Action
9,10,GoldenEye (1995),Action|Adventure|Thriller


### Merge data into a single DataFrame

In [8]:
data = pd.merge(pd.merge(ratings, users), movies)

In [9]:
data.head() #head() displays only the first lines for viewing

Unnamed: 0,user_id,movie_id,rating,timestamp,gender,age,occupation,zip,title,genres
0,1,1193,5,978300760,F,1,10,48067,One Flew Over the Cuckoo's Nest (1975),Drama
1,2,1193,5,978298413,M,56,16,70072,One Flew Over the Cuckoo's Nest (1975),Drama
2,12,1193,4,978220179,M,25,12,32793,One Flew Over the Cuckoo's Nest (1975),Drama
3,15,1193,4,978199279,M,25,7,22903,One Flew Over the Cuckoo's Nest (1975),Drama
4,17,1193,5,978158471,M,50,1,95350,One Flew Over the Cuckoo's Nest (1975),Drama


In [10]:
len(data)

1000209

# Let's explore!

### Question 0: The basics 

How many notes are in the database? Is it predominantly male or female? What is the average age and variance? 

Write a function that prints the sum and average of a `pandas.Series` of type `bool`

In [72]:
def sum_and_mean(dbool):
    """
    Function to find out the number of Booleans 
    """
    print 'Somme : %d ; Pourcentage : %.2f' %(#a remplir 
        , #a remplir 
        )

### Question 1: Boolean conditions

How many films have a rating above 4.5? Is there a difference between men and women?

It's best to look at the proportions to get the right answer:

Checking that it's the same thing 

### Question 2: Boolean operations and grouping

How many films have a median rating above 4.5 among men over 30? among women over 30?

### Question 3: Be careful with data size

What are the best-rated films?

Are they really the most popular? 

We will define a minimum popularity threshold (based on the number of ratings obtained), and only keep films that are above this threshold. 

What is the film most often rated by users? 

# Data visualization

In [41]:
# command to view figures in the notebook
%matplotlib inline 

### Question 4a: Histogram

Display the histogram of all film scores?

### Question 4b: Histogram

Display the histogram of the number of ratings received by each film

### Question 5: Dependencies

Show histogram of average film scores. 

Does the distribution of scores depend on gender?

### Question 6: Density estimation

Display the histogram of ratings for films rated more than threshold_pop = 30 times.

Now the density. 

As a reminder, the density $f_d$ is such that $P(X<x) = \int_{-\infty}^{x} f_d(x') \, \mathrm{d}x'$

### Question 7 : Scatter plot

Display a scatter plot of average male versus female ratings for each film (rated over threshold_pop = 100 times).

### Question 8: Anomalies

Display a "scatter plot" of average male vs. female ratings for each film rated less than threshold_pop = 100 times.

### Open question: Interpretation
What disparity in behavior can be observed between men and women according to the last figure?

Is it really a disparity in behavior? 