# Cleaning the data

In [2]:
import pandas as pd
import numpy as np

Read in ratings data

In [3]:
ratings_messy = pd.read_csv("IMDb ratings.csv")

Drop irrelevant columns

In [4]:
ratings = ratings_messy.drop(['allgenders_0age_votes',
                              'allgenders_18age_votes',
                              'allgenders_30age_votes',
                              'allgenders_45age_votes',
                              'males_allages_votes',
                              'males_0age_votes',
                              'males_18age_votes',
                              'males_30age_votes',
                              'males_45age_votes',
                              'females_allages_votes',
                              'females_0age_votes',
                              'females_18age_votes',
                              'females_30age_votes',
                              'females_45age_votes',
                              'top1000_voters_rating',
                              'top1000_voters_votes',
                              'us_voters_rating',
                              'us_voters_votes',
                              'non_us_voters_rating',
                              'non_us_voters_votes'],
                             1)

  ratings = ratings_messy.drop(['allgenders_0age_votes',


# Integrate genres into the clean dataframe

In [5]:
len(ratings.index)

85855

There are 85855 films in our dataset, each of which have one or more genres associated with them in the following format:  
  
0                          Romance  
1          Biography, Crime, Drama  
2                            Drama  
3                   Drama, History  
4        Adventure, Drama, Fantasy  
                   ...              
85850                       Comedy  
85851                Comedy, Drama  
85852                        Drama  
85853                Drama, Family  
85854                        Drama  
  
We see that some films have multiple genres in the same string, so we have some cleaning to do. We need to figure out a way to find all the unique genres and create columns for each one, indicating which genre is associated with a given film using a 1 (True) or 0 (False).

Start by creating a list of the genres for each movie.

In [6]:
genres = pd.read_csv('IMDb movies.csv').genre
ratings['genres'] = genres

  exec(code_obj, self.user_global_ns, self.user_ns)


The following code iterates through the list of movie's genres and, if there are multiple, splits them into lists. If we come across a genre that we haven't seen yet, we add it to the list ```genres_unique```.

In [7]:
genres_unique = []

for film in range(0,len(genres)):
    film_genres = genres[film].split(", ")
    for genre in film_genres:
        if (genre not in genres_unique):
            genres_unique.append(genre)

Let's check if the list is indeed unique:

In [11]:
#pd.Series(genres_unique).nunique()
# 25
#len(genres_unique)
# 25

There are 25 unique genres associated with the films in the dataset. Now we will cycle through each genre, creating a column for each one and assigning 1 if the film contains that genre, and 0 if it doesn't.

In [12]:
for genre in genres_unique:
    ratings[genre] = ratings['genres'].str.contains(genre).astype(int)

# Drop the messy genres column
ratings = ratings.drop('genres', axis=1)

The dataset now includes genre data in one-hot encoded form.

In [16]:
ratings.head()

Unnamed: 0,imdb_title_id,weighted_average_vote,total_votes,mean_vote,median_vote,votes_10,votes_9,votes_8,votes_7,votes_6,...,Thriller,Sport,Animation,Musical,Music,Film-Noir,Adult,Documentary,Reality-TV,News
0,tt0000009,5.9,154,5.9,6.0,12,4,10,43,28,...,0,0,0,0,0,0,0,0,0,0
1,tt0000574,6.1,589,6.3,6.0,57,18,58,137,139,...,0,0,0,0,0,0,0,0,0,0
2,tt0001892,5.8,188,6.0,6.0,6,6,17,44,52,...,0,0,0,0,0,0,0,0,0,0
3,tt0002101,5.2,446,5.3,5.0,15,8,16,62,98,...,0,0,0,0,0,0,0,0,0,0
4,tt0002130,7.0,2237,6.9,7.0,210,225,436,641,344,...,0,0,0,0,0,0,0,0,0,0


Export data:

In [17]:
ratings.to_csv('ratings_clean.csv')