<p style="font-family: Arial; font-size:3em;color:purple; font-style:bold"><br>
Mini Project - Movie Ratings and Volatility</p><br>
<br>

The aim of this project is to answer the following question:

**Are movie ratings more biased at time of release, and do they become less volatile, and thus potentailly more reasonable, over time?**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
movies = pd.read_csv('../Week-4-Pandas/movielens/movies.csv', sep=',')

In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


We can indicate time of release by year. We need to extract year from each movie's title and place it in to a new column 'year'. This is done using regular expressions and the Pandas `.str.extract()` method as below. We also don't need the genres column so lets take that out.

In [4]:
movies['year'] = movies['title'].str.extract('.*\((.*)\).*', expand=True)
del(movies['genres'])

In [5]:
movies.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story (1995),1995
1,2,Jumanji (1995),1995
2,3,Grumpier Old Men (1995),1995
3,4,Waiting to Exhale (1995),1995
4,5,Father of the Bride Part II (1995),1995


In [6]:
ratings = pd.read_csv('../Week-4-Pandas/movielens/ratings.csv', sep=',')

In [7]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,1112486027
1,1,29,3.5,1112484676
2,1,32,3.5,1112484819
3,1,47,3.5,1112484727
4,1,50,3.5,1112484580


We want to convert our rating timestamps from Unix time (seconds since 1 Jan 1970) to human readable time. This can be done by using the Pandas `.to_datetime()` method. We will put this in to new column 'datetime'. To keep things tidy, we'll also remove the previous 'timestamp' column.

In [8]:
ratings['datetime'] = pd.to_datetime(ratings['timestamp'], unit='s')
del(ratings['timestamp'])

In [9]:
ratings.head()

Unnamed: 0,userId,movieId,rating,datetime
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


Using the `.min()` and `.max()` functions, we can see that our ratings data begins in 1995 and ends before 2016 - so lets only consider movies between these years. 

In [10]:
print(ratings['datetime'].min())
print(ratings['datetime'].max())

1995-01-09 11:46:44
2015-03-31 06:40:02


We create a mask to filter out movies from 1995 to 2016.

In [11]:
m1 = movies['year'] >= '1995'
m2 = movies['year'] < '2016'

fmovies = movies[m1 & m2]

movie_yrs = fmovies['year'].unique().tolist()
movie_yrs

['1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2009– ',
 '2007-',
 '2015']

By putting the unique values to a list - we can see there's problematic values that contain dashes and spaces. Lets remove these.

In [12]:
fmovies['year'] = fmovies['year'].map(lambda x: x.rstrip(' -–'))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [13]:
fmovies['year'] = pd.to_datetime(fmovies['year'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [14]:
fmovies.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story (1995),1995-01-01
1,2,Jumanji (1995),1995-01-01
2,3,Grumpier Old Men (1995),1995-01-01
3,4,Waiting to Exhale (1995),1995-01-01
4,5,Father of the Bride Part II (1995),1995-01-01


Now lets merge our two dataframes in to one by movieId. 

In [15]:
mvrt = ratings.merge(fmovies, on='movieId', how='inner')

In [16]:
mvrt.head()

Unnamed: 0,userId,movieId,rating,datetime,title,year
0,1,2,3.5,2005-04-02 23:53:47,Jumanji (1995),1995-01-01
1,5,2,3.0,1996-12-25 15:26:09,Jumanji (1995),1995-01-01
2,13,2,3.0,1996-11-27 08:19:02,Jumanji (1995),1995-01-01
3,29,2,3.0,1996-06-23 20:36:14,Jumanji (1995),1995-01-01
4,34,2,3.0,1996-10-28 13:29:44,Jumanji (1995),1995-01-01


In [17]:
print(mvrt['year'][0] + pd.DateOffset(years=1))

1996-01-01 00:00:00


In [19]:
mask1 = mvrt['datetime'] < (mvrt['year'] + pd.DateOffset(years=1))

mvrt[mask1]

Unnamed: 0,userId,movieId,rating,datetime,title,year
116714,131160,47,5.0,1995-01-09 11:46:49,Seven (a.k.a. Se7en) (1995),1995-01-01
190508,29,653,3.0,1996-06-23 20:44:56,Dragonheart (1996),1996-01-01
190511,75,653,4.0,1996-07-02 13:04:16,Dragonheart (1996),1996-01-01
190520,149,653,4.0,1996-11-26 15:19:23,Dragonheart (1996),1996-01-01
190523,158,653,3.0,1996-09-14 21:06:52,Dragonheart (1996),1996-01-01
190524,159,653,3.0,1996-12-16 17:48:43,Dragonheart (1996),1996-01-01
190527,199,653,3.0,1996-12-17 20:38:49,Dragonheart (1996),1996-01-01
190541,324,653,3.0,1996-12-13 16:23:37,Dragonheart (1996),1996-01-01
190542,325,653,4.0,1996-11-30 21:54:01,Dragonheart (1996),1996-01-01
190587,760,653,4.0,1996-06-06 13:04:45,Dragonheart (1996),1996-01-01
