# Movie Recommendation System

By: Anat Jacobson, Stephanie Ciaccia and Charlie Jin

---

### Overview: 

Build a model that provides top 5 movie recommendations to a user, based on their ratings of other movies.

The MovieLens dataset is a "classic" recommendation system dataset, that is used in numerous academic papers and machine learning proofs-of-concept. You will need to create the specific details about how the user will provide their ratings of other movies, in addition to formulating a more specific business problem within the general context of "recommending movies".

### Introduction and Business Case:

### Business Understandings:
#### Applications:
#### Data: 
### Methodology:

---

## Data Exploration and Analysis
Importing all packages needed

In [202]:
#imports
import pandas as pd
import numpy as np

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.metrics import log_loss
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from imblearn.pipeline import Pipeline as ImPipeline

### Import CSVs, Merge, and Clean all Data Needed
4 Datasets imported:
- Movies:
- Links: 
- Ratings:
- Tags:


In [203]:
#Import all relevant data tables
movies_df = pd.read_csv('data/movies.csv')
links_df = pd.read_csv('data/links.csv')
ratings_df = pd.read_csv('data/ratings.csv')
tags_df = pd.read_csv('data/tags.csv')

In [204]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [205]:
movies_df.shape

(9742, 3)

In [206]:
links_df.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [207]:
links_df.shape

(9742, 3)

In [208]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [209]:
ratings_df.shape

(100836, 4)

In [210]:
tags_df.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [211]:
tags_df.shape

(3683, 4)

In [212]:
#Merging links and movies
links_and_movies = pd.merge(movies_df, links_df, 'left', on = 'movieId')
links_and_movies

Unnamed: 0,movieId,title,genres,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,113041,11862.0
...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,5476944,432131.0
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,5914996,445030.0
9739,193585,Flint (2017),Drama,6397426,479308.0
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,8391976,483455.0


In [213]:
movies_and_ratings = pd.merge(links_and_movies, ratings_df, 'left', on = 'movieId')
movies_and_ratings

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,7.0,4.5,1.106636e+09
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,15.0,2.5,1.510578e+09
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,17.0,4.5,1.305696e+09
...,...,...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,5476944,432131.0,184.0,4.0,1.537109e+09
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,5914996,445030.0,184.0,3.5,1.537110e+09
100851,193585,Flint (2017),Drama,6397426,479308.0,184.0,3.5,1.537110e+09
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,8391976,483455.0,184.0,3.5,1.537110e+09


In [214]:
full_df = pd.merge(movies_and_ratings, tags_df, 'left', on = 'movieId')
full_df.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,336.0,pixar,1139046000.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,474.0,pixar,1137207000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,567.0,fun,1525286000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,336.0,pixar,1139046000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,474.0,pixar,1137207000.0


In [215]:
full_df.shape

(285783, 11)

In [216]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 285783 entries, 0 to 285782
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      285783 non-null  int64  
 1   title        285783 non-null  object 
 2   genres       285783 non-null  object 
 3   imdbId       285783 non-null  int64  
 4   tmdbId       285770 non-null  float64
 5   userId_x     285762 non-null  float64
 6   rating       285762 non-null  float64
 7   timestamp_x  285762 non-null  float64
 8   userId_y     233234 non-null  float64
 9   tag          233234 non-null  object 
 10  timestamp_y  233234 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 26.2+ MB


In [217]:
full_df.drop(full_df[full_df['userId_y'].isna()].index, inplace = True)
#this drops all the rows with NA. or we could drop the columns

In [218]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233234 entries, 0 to 285773
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      233234 non-null  int64  
 1   title        233234 non-null  object 
 2   genres       233234 non-null  object 
 3   imdbId       233234 non-null  int64  
 4   tmdbId       233234 non-null  float64
 5   userId_x     233213 non-null  float64
 6   rating       233213 non-null  float64
 7   timestamp_x  233213 non-null  float64
 8   userId_y     233234 non-null  float64
 9   tag          233234 non-null  object 
 10  timestamp_y  233234 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 21.4+ MB


In [219]:
full_df.drop(full_df[full_df['rating'].isna()].index, inplace = True)

In [220]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 233213 entries, 0 to 285773
Data columns (total 11 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   movieId      233213 non-null  int64  
 1   title        233213 non-null  object 
 2   genres       233213 non-null  object 
 3   imdbId       233213 non-null  int64  
 4   tmdbId       233213 non-null  float64
 5   userId_x     233213 non-null  float64
 6   rating       233213 non-null  float64
 7   timestamp_x  233213 non-null  float64
 8   userId_y     233213 non-null  float64
 9   tag          233213 non-null  object 
 10  timestamp_y  233213 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 21.4+ MB


In [221]:
full_df.describe() #exploring that dataset

Unnamed: 0,movieId,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,timestamp_y
count,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0,233213.0
mean,12319.999443,261063.2,9378.277742,309.688191,3.966535,1213524000.0,470.683564,1384774000.0
std,28243.919401,441441.1,36943.1398,178.206387,0.968637,225044800.0,153.329632,153462100.0
min,1.0,12349.0,11.0,1.0,0.5,828124600.0,2.0,1137179000.0
25%,296.0,110357.0,278.0,156.0,3.5,1017365000.0,424.0,1242494000.0
50%,1198.0,110912.0,680.0,309.0,4.0,1217325000.0,477.0,1457901000.0
75%,4638.0,172495.0,1892.0,460.0,5.0,1443201000.0,599.0,1498457000.0
max,193565.0,5580390.0,503475.0,610.0,5.0,1537799000.0,610.0,1537099000.0


In [222]:
full_df.head() 

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,336.0,pixar,1139046000.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,474.0,pixar,1137207000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,567.0,fun,1525286000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,336.0,pixar,1139046000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,474.0,pixar,1137207000.0


What do we have?
- We have movies and then features 
- Features are:
    - Genere
    - Rating
    - Tag
    
Each movie should probably only be only 1 row. 
Have to compare by looking at the cosine of the angle between them. 
Closer to 1 is the more similar they are. Each row is a vector


In [223]:
full_df.head()

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,336.0,pixar,1139046000.0
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,474.0,pixar,1137207000.0
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,964982703.0,567.0,fun,1525286000.0
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,336.0,pixar,1139046000.0
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,847434962.0,474.0,pixar,1137207000.0


one hot encode -- > genere, tag --> dummy or one hot encode

In [224]:
#subpipe_numeric = Pipeline(steps=[('ss', StandardScaler())])
#subpipe_categorical = Pipeline(steps=[('ohe', OneHotEncoder(sparse=False, handle_unknown= 'ignore'))])
#CT = ColumnTransformer(transformers=[('subpipe_numeric', subpipe_numeric, scale),
#                                         ('subpipe_categorical', subpipe_categorical, cat)],
#                           remainder='passthrough')

In [225]:
ratings_df['userId'].min()

1

In [226]:
ratings_df['userId'].max() #go according to ratings... 

610

In [227]:
ratings_df #have to fix this dataframe . Ratings is the main one? Based on user IDb

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [228]:
full_df['genres'].value_counts()

Comedy|Crime|Drama|Thriller           56851
Action|Crime|Drama|Thriller           16725
Action|Adventure|Sci-Fi               11939
Drama                                  6242
Comedy                                 5486
                                      ...  
Drama|Fantasy|Horror                      1
Animation|Children|Musical                1
Drama|Thriller|Western                    1
Adventure|Crime|Drama|Thriller            1
Comedy|Crime|Drama|Mystery|Romance        1
Name: genres, Length: 370, dtype: int64

In [229]:
genres = (full_df.genres.str.split('|', expand=True)
            .stack()
            .to_frame(name='genres'))

genres.index = genres.index.droplevel(1)

In [230]:
genres2 = full_df['genres'].str.split('|', expand=True)
genres2

Unnamed: 0,0,1,2,3,4,5,6
0,Adventure,Animation,Children,Comedy,Fantasy,,
1,Adventure,Animation,Children,Comedy,Fantasy,,
2,Adventure,Animation,Children,Comedy,Fantasy,,
3,Adventure,Animation,Children,Comedy,Fantasy,,
4,Adventure,Animation,Children,Comedy,Fantasy,,
...,...,...,...,...,...,...,...
285745,Action,Adventure,Children,Sci-Fi,,,
285770,Action,Animation,Comedy,Sci-Fi,,,
285771,Action,Animation,Comedy,Sci-Fi,,,
285772,Action,Animation,Comedy,Sci-Fi,,,


In [231]:
genres2[6].value_counts()

IMAX       3718
Mystery      97
Name: 6, dtype: int64

In [232]:
df = full_df.join(genres2)

In [233]:
df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,0,1,2,3,4,5,6
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,336.0,pixar,1.139046e+09,Adventure,Animation,Children,Comedy,Fantasy,,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,474.0,pixar,1.137207e+09,Adventure,Animation,Children,Comedy,Fantasy,,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,567.0,fun,1.525286e+09,Adventure,Animation,Children,Comedy,Fantasy,,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,336.0,pixar,1.139046e+09,Adventure,Animation,Children,Comedy,Fantasy,,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,474.0,pixar,1.137207e+09,Adventure,Animation,Children,Comedy,Fantasy,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285745,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,3778644,348350.0,586.0,5.0,1.529900e+09,62.0,star wars,1.528935e+09,Action,Adventure,Children,Sci-Fi,,,
285770,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,anime,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,
285771,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,comedy,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,
285772,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,gintama,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,


In [234]:
df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,0,1,2,3,4,5,6
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,336.0,pixar,1.139046e+09,Adventure,Animation,Children,Comedy,Fantasy,,
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,474.0,pixar,1.137207e+09,Adventure,Animation,Children,Comedy,Fantasy,,
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,567.0,fun,1.525286e+09,Adventure,Animation,Children,Comedy,Fantasy,,
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,336.0,pixar,1.139046e+09,Adventure,Animation,Children,Comedy,Fantasy,,
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,474.0,pixar,1.137207e+09,Adventure,Animation,Children,Comedy,Fantasy,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285745,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,3778644,348350.0,586.0,5.0,1.529900e+09,62.0,star wars,1.528935e+09,Action,Adventure,Children,Sci-Fi,,,
285770,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,anime,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,
285771,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,comedy,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,
285772,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,gintama,1.537099e+09,Action,Animation,Comedy,Sci-Fi,,,


Make year own column. Do this through a loop or the way below. 

In [236]:
full_df['year'] = full_df['title'].str[-6:]
full_df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,year
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,336.0,pixar,1.139046e+09,(1995)
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,474.0,pixar,1.137207e+09,(1995)
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,567.0,fun,1.525286e+09,(1995)
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,336.0,pixar,1.139046e+09,(1995)
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,474.0,pixar,1.137207e+09,(1995)
...,...,...,...,...,...,...,...,...,...,...,...,...
285745,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,3778644,348350.0,586.0,5.0,1.529900e+09,62.0,star wars,1.528935e+09,(2018)
285770,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,anime,1.537099e+09,(2010)
285771,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,comedy,1.537099e+09,(2010)
285772,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,gintama,1.537099e+09,(2010)


In [237]:
#Title without year
full_df['real_title'] = full_df['title'].str[:-6]
full_df

Unnamed: 0,movieId,title,genres,imdbId,tmdbId,userId_x,rating,timestamp_x,userId_y,tag,timestamp_y,year,real_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,336.0,pixar,1.139046e+09,(1995),Toy Story
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,474.0,pixar,1.137207e+09,(1995),Toy Story
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,1.0,4.0,9.649827e+08,567.0,fun,1.525286e+09,(1995),Toy Story
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,336.0,pixar,1.139046e+09,(1995),Toy Story
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0,5.0,4.0,8.474350e+08,474.0,pixar,1.137207e+09,(1995),Toy Story
...,...,...,...,...,...,...,...,...,...,...,...,...,...
285745,187595,Solo: A Star Wars Story (2018),Action|Adventure|Children|Sci-Fi,3778644,348350.0,586.0,5.0,1.529900e+09,62.0,star wars,1.528935e+09,(2018),Solo: A Star Wars Story
285770,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,anime,1.537099e+09,(2010),Gintama: The Movie
285771,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,comedy,1.537099e+09,(2010),Gintama: The Movie
285772,193565,Gintama: The Movie (2010),Action|Animation|Comedy|Sci-Fi,1636780,71172.0,184.0,3.5,1.537099e+09,184.0,gintama,1.537099e+09,(2010),Gintama: The Movie


# Ignore All below: 

In [116]:
title = full_df['title'].str.split('(', expand=True)
title

Unnamed: 0,0,1,2,3
0,Toy Story,1995),,
1,Toy Story,1995),,
2,Toy Story,1995),,
3,Toy Story,1995),,
4,Toy Story,1995),,
...,...,...,...,...
285745,Solo: A Star Wars Story,2018),,
285770,Gintama: The Movie,2010),,
285771,Gintama: The Movie,2010),,
285772,Gintama: The Movie,2010),,


In [120]:
title[3].value_counts()

1994)    4655
2002)     185
1992)      14
2005)       5
1945)       2
1998)       1
Name: 3, dtype: int64

In [125]:
title.loc[title[3] == '1994)']


Unnamed: 0,0,1,2,3
23286,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23287,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23288,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23289,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23290,Léon: The Professional,a.k.a. The Professional),Léon),1994)
...,...,...,...,...
27936,Léon: The Professional,a.k.a. The Professional),Léon),1994)
27937,Léon: The Professional,a.k.a. The Professional),Léon),1994)
27938,Léon: The Professional,a.k.a. The Professional),Léon),1994)
27939,Léon: The Professional,a.k.a. The Professional),Léon),1994)


In [137]:
title.loc[title[3].notnull()]

Unnamed: 0,0,1,2,3
23286,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23287,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23288,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23289,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23290,Léon: The Professional,a.k.a. The Professional),Léon),1994)
...,...,...,...,...
242318,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242319,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242320,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242321,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)


In [159]:
nulls = title.loc[title[2].notnull()]

In [160]:
nulls

Unnamed: 0,0,1,2,3
2383,"City of Lost Children, The","Cité des enfants perdus, La)",1995),
2384,"City of Lost Children, The","Cité des enfants perdus, La)",1995),
2385,"City of Lost Children, The","Cité des enfants perdus, La)",1995),
2386,"City of Lost Children, The","Cité des enfants perdus, La)",1995),
2387,"City of Lost Children, The","Cité des enfants perdus, La)",1995),
...,...,...,...,...
279973,Angel's Egg,Tenshi no tamago),1985),
279974,Angel's Egg,Tenshi no tamago),1985),
279975,Angel's Egg,Tenshi no tamago),1985),
280905,The Rabbi's Cat,Le chat du rabbin),2011),


In [155]:
nulls2 = nulls[0] + "(" + nulls[1]

In [156]:
pd.DataFrame(nulls2)

Unnamed: 0,0
2383,"City of Lost Children, The (Cité des enfants p..."
2384,"City of Lost Children, The (Cité des enfants p..."
2385,"City of Lost Children, The (Cité des enfants p..."
2386,"City of Lost Children, The (Cité des enfants p..."
2387,"City of Lost Children, The (Cité des enfants p..."
...,...
279973,Angel's Egg (Tenshi no tamago)
279974,Angel's Egg (Tenshi no tamago)
279975,Angel's Egg (Tenshi no tamago)
280905,The Rabbi's Cat (Le chat du rabbin)


In [167]:
nulls3 = title.loc[title[3].notnull()]
nulls3

Unnamed: 0,0,1,2,3
23286,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23287,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23288,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23289,Léon: The Professional,a.k.a. The Professional),Léon),1994)
23290,Léon: The Professional,a.k.a. The Professional),Léon),1994)
...,...,...,...,...
242318,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242319,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242320,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)
242321,Lady Vengeance,Sympathy for Lady Vengeance),Chinjeolhan geumjassi),2005)


In [187]:
nulls4 = nulls3[0] + "(" + nulls3[1] + "(" + nulls3[2]
New_df = pd.DataFrame(nulls4)

New_df['Years'] = pd.DataFrame(nulls3[3])
#this is dataframe with nulls from all columns  

Unnamed: 0,0,Years
23286,Léon: The Professional (a.k.a. The Professiona...,1994)
23287,Léon: The Professional (a.k.a. The Professiona...,1994)
23288,Léon: The Professional (a.k.a. The Professiona...,1994)
23289,Léon: The Professional (a.k.a. The Professiona...,1994)
23290,Léon: The Professional (a.k.a. The Professiona...,1994)
...,...,...
242318,Lady Vengeance (Sympathy for Lady Vengeance) (...,2005)
242319,Lady Vengeance (Sympathy for Lady Vengeance) (...,2005)
242320,Lady Vengeance (Sympathy for Lady Vengeance) (...,2005)
242321,Lady Vengeance (Sympathy for Lady Vengeance) (...,2005)


In [172]:
dropped_df = title.loc[title[2].isnull()]

In [188]:
dropped_df

Unnamed: 0,0,1,2,3
0,Toy Story,1995),,
1,Toy Story,1995),,
2,Toy Story,1995),,
3,Toy Story,1995),,
4,Toy Story,1995),,
...,...,...,...,...
285745,Solo: A Star Wars Story,2018),,
285770,Gintama: The Movie,2010),,
285771,Gintama: The Movie,2010),,
285772,Gintama: The Movie,2010),,


In [189]:
dropped_df.append(New_df, ignore_index = True)

Unnamed: 0,0,1,2,3,Years
0,Toy Story,1995),,,
1,Toy Story,1995),,,
2,Toy Story,1995),,,
3,Toy Story,1995),,,
4,Toy Story,1995),,,
...,...,...,...,...,...
222517,Lady Vengeance (Sympathy for Lady Vengeance) (...,,,,2005)
222518,Lady Vengeance (Sympathy for Lady Vengeance) (...,,,,2005)
222519,Lady Vengeance (Sympathy for Lady Vengeance) (...,,,,2005)
222520,Lady Vengeance (Sympathy for Lady Vengeance) (...,,,,2005)
