# Background of Problem Statement :

The GroupLens Research Project is a research group in the Department of Computer Science and Engineering at the University of Minnesota. Members of the GroupLens Research Project are involved in many research projects related to the fields of information filtering, collaborative filtering, and recommender systems. The project is led by professors John Riedl and Joseph Konstan. The project began to explore automated collaborative filtering in 1992 but is most well known for its worldwide trial of an automated collaborative filtering system for Usenet news in 1996. Since then the project has expanded its scope to research overall information by filtering solutions, integrating into content-based methods, as well as, improving current collaborative filtering technology.

# Problem Objective :
    
Here, we ask you to perform the analysis using the Exploratory Data Analysis technique. You need to find features affecting the ratings of any particular movie and build a model to predict the movie ratings.

Analysis Tasks to be performed:

1. Import the three datasets

2. Create a new dataset with the following columns MovieID Title UserID Age Gender Occupation Rating. (Hint: (i) Merge two tables at a time. (ii) Merge the tables using two primary keys MovieID & UserId)

3. Explore the datasets using visual representations (graphs or tables), 

4. also include your comments on the following:
    a. User Age Distribution

    b. User rating of the movie “Toy Story”

    c. Top 25 movies by viewership rating

    d. Find the ratings for all the movies reviewed by for a particular user of user id = 2696
    
5. Feature Engineering:
            
    Use column genres:

    a. Find out all the unique genres (Hint: split the data in column genre making a list and then process the data to find out only the unique categories of genres)

    b. Create a separate column for each genre category with a one-hot encoding ( 1 and 0) whether or not the movie belongs to that genre. 
        
    c. Determine the features affecting the ratings of any particular movie.
    
    d. Develop an appropriate model to predict the movie ratings

# Ratings.dat

Format - UserID::MovieID::Rating::Timestamp

Field Description

UserID - Unique identification for each user

MovieID - Unique identification for each movie

Rating - User rating for each movie

Timestamp - Timestamp generated while adding user review

# Users.dat
Format - UserID::Gender::Age::Occupation::Zip-code

Field - Description

UserID - Unique identification for each user

Genere - Category of each movie

Age - User’s age

Occupation - User’s Occupation

Zip-code - Zip Code for the user’s location

# Movies.dat

Format - MovieID::Title::Genres

Field - Description

MovieID - Unique identification for each movie

Title - A title for each movie

Genres - Category of each movie

In [None]:
#Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

from datetime import datetime

In [None]:
# Load the datasets into different dataframes
df_movies = pd.read_csv('movies.dat',sep='::', names=['MovieID','Title','Genres'])
df_users = pd.read_csv('users.dat',sep='::', names=['UserID','Gender','Age','Occupation','Zip'])
df_ratings = pd.read_csv('ratings.dat',sep='::', names=['UserID','MovieID','Rating','Timestamp'])

In [None]:
df_movies.head()

In [None]:
df_users.head()

In [None]:
df_ratings.head()

In [None]:
# Join all the dataframes based on common columns to create the final dataframe
final_df = pd.merge(df_ratings, df_users, how = 'inner', on=['UserID'])
final_df = pd.merge(final_df, df_movies, how = 'inner', on=['MovieID'])

In [None]:
final_df.head()

In [None]:
len(final_df)

In [None]:
# Convert the timestamp column to date and time values
final_df['Timestamp']= [datetime.fromtimestamp(x) for x in final_df['Timestamp']]

In [None]:
final_df.head(5)

###### Find the Unique Genres

In [None]:
# Function to find unique genres
def find_uniq_genre(data):
    genre_list=[]
    for row in data:
        row_list = row.split(sep='|')
        genre_list = genre_list + row_list
    return set(genre_list)

# Call the function to find the list of unique genres
uniq_genres = find_uniq_genre(final_df['Genres'].unique())
print(uniq_genres)

###### Create the genre columns and fill with 1 and 0 for all the rows

In [None]:
for genre in uniq_genres:
    final_df[genre] = np.nan

In [None]:
for i, row in final_df.iterrows():
    for x in row['Genres'].split(sep='|'):
        final_df.set_value(i,x,1)
final_df.fillna(0)

###### Distribution of the User Age distribution

In [None]:
df_users.groupby('Age')['UserID'].count()

In [None]:
df_users.groupby('Age')['UserID'].count().plot(kind = 'bar', color = 'blue',figsize = (6,4))
plt.xlabel('Age')
plt.ylabel('Number of Users')
plt.title('User Age Distribution')
plt.xticks(rotation=0)
plt.show()

###### Average User rating of the movie “Toy Story”

In [None]:
#No of user ratings against each rating for movie Toy Story (1995)
final_df[final_df.Title == 'Toy Story (1995)'].groupby('Rating')['UserID'].count()

In [None]:
final_df[final_df.Title == 'Toy Story (1995)'].groupby('Rating')['UserID'].count().\
    plot(kind = 'bar', color = 'blue',figsize = (6,5))
plt.xlabel('Rating')
plt.ylabel('No of Users')
plt.title('Distribution of User Rating of Toy Story')
plt.xticks(rotation=0)
plt.show()

In [None]:
# Average mena rating for the movie Toy Story (1995)
round(final_df['Rating'][final_df['Title'] == 'Toy Story (1995)'].mean(),2)

###### Top 25 movies by viewership rating

In [None]:
# List the top 25 movies having more maximum ratings.
final_df.groupby(['MovieID','Title'])['Rating'].count().sort_values(ascending=False).head(25)

In [None]:
#drop movies having less than 100 ratings
no_of_ratings = 100
new_df = final_df
#print('before',len(new_df))

drop_movie_list = []
for x in df_movies['MovieID']:
    if len(new_df[new_df['MovieID']==x]) < no_of_ratings:
        drop_movie_list.append(x)
        
df = new_df[~new_df['MovieID'].isin(drop_movie_list)]

In [None]:
# Find top 25 movies with highest average rating with atlerast 100+ ratings
df.groupby(['Title'])['Rating'].mean().sort_values(ascending=False).head(25)

###### Find the ratings for all the movies reviewed by for a particular user of user id = 2696

In [None]:
#list ratings by User 2696
final_df[['MovieID','Title','Age','Gender','Occupation','Rating']][final_df['UserID']==2696]

In [None]:
# For user 2696, Count of different ratings 
final_df[final_df.UserID == 2696].groupby('Rating')['MovieID'].count()

In [None]:
final_df[final_df.UserID == 2696].groupby('Rating')['MovieID'].count().plot(kind = 'barh', color = 'blue',figsize = (6,5))
plt.xlabel('Rating')
plt.ylabel('No of Ratings')
plt.title('Movie Ratings of UserID 2696')
plt.show()

In [None]:
final_df = final_df.fillna(0)
final_df.head()

In [None]:
# Encode the Gender Column
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder() 
final_df['Gender']= label_encoder.fit_transform(final_df['Gender'])

In [None]:
final_df['RDate'] = pd.to_datetime(final_df['Timestamp']).dt.date

In [None]:
# Drop the columns which will not impact the prediction
final_df = final_df.drop(['Title','Timestamp','Genres','MovieID', 'UserID', 'RDate','Zip' ], axis=1)

In [None]:
final_df.head()

In [None]:
final_df_features = final_df.iloc[:,1:]
final_df_label = final_df.Rating

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(final_df_features, final_df_label, test_size=0.3,random_state=42)                                                   

###### Build a random forest classifier model

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

In [None]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

###### Find the Important Features which influences the rating

In [None]:
feature_imp = pd.Series(clf.feature_importances_).sort_values(ascending=False)
feature_imp

In [None]:
# final_df = final_df[:1000]
# len(final_df)

In [None]:
# We will take the first 3 important feature i.e. Occupation, Age and Gender and rebuild the model
final_df_features = final_df[['Occupation','Age','Gender']]
final_df_label = final_df.Rating

In [None]:
X_train, X_test, y_train, y_test = train_test_split(final_df_features, final_df_label, test_size=0.3,random_state=42)

In [None]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

###### Build a Multi Class Classification Model Naive Bias Classifier

In [None]:
# from sklearn.svm import SVC

# svc = SVC()
# svc.fit(X_train, y_train)
# y_pred = svc.predict(X_test)

# print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
from sklearn.naive_bayes import GaussianNB

gaussian = GaussianNB()

gaussian.fit(X_train, y_train)
y_pred = gaussian.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))