## Introduction 

Hello! This is the first data science project I will be conducting, outside of any academic or professional environment, for the purposes of uploading for public view. I had originally considered a finance-based dataset, but stumbled upon this (at the time)fresh dataset on Kaggle that contained both qualitative and quantitative data, and more importantly, centred around a field that I have a personal interest in: films. 

The intended goal of this notebook, from a personal and professional development standpoint, is to display the application of certain data science related concepts that I have been learning primarily through my postgraduate program at the University of the West Indies, Mona, and secondarily through independent study using online resources. In terms of business applications, this project aims to classify certain films based on variables such as runtime, and genre, to recommend to cinema operations staff those films that may provide the greatest revenue.

This dataset may be downloaded at: https://www.kaggle.com/preetviradiya/imdb-movies-ratings-details.

## Data Dictionary

| Field | Details |
|  ---  |   ---   |
| name  | The title of the film |
| year  | The year the film was released |
| runtime | The duration of the film in minutes |
| genre | The genre of the film |
| rating | The score of the film from 0 - 10 based on votes by imdb users |
| metascore | The score of the film from 0 - 10 based on data from metacritic.com |
| timeline | A short summary of the events of the film |
| votes | The number of imdb users that have rated the film |
| gross | The box office revenue for that film |


In [1]:
import re
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans, MeanShift, DBSCAN, Birch
from sklearn import metrics

In [2]:
df = pd.read_csv('IMDB_movie_reviews_details.csv')

## Data Preparation

In [3]:
# View a list of the fields/columns in the dataset
print('The field names are', ", ".join([str(i) for i in df.columns.to_list()]), '.')
print('The size of the dataframe is', len(df), 'records.')

The field names are Unnamed: 0, name, year, runtime, genre, rating, metascore, timeline, votes, gross .
The size of the dataframe is 1000 records.


In [4]:
# The dataset has relatively few columns, so we can call the head method to view the top 5 records
df.head()

Unnamed: 0.1,Unnamed: 0,name,year,runtime,genre,rating,metascore,timeline,votes,gross
0,0,The Shawshank Redemption,1994,142,Drama,9.3,80.0,Two imprisoned men bond over a number of years...,2394059,$28.34M
1,1,The Godfather,1972,175,"Crime, Drama",9.2,100.0,An organized crime dynasty's aging patriarch t...,1658439,$134.97M
2,2,Soorarai Pottru,2020,153,Drama,9.1,,"Nedumaaran Rajangam ""Maara"" sets out to make t...",78266,
3,3,The Dark Knight,2008,152,"Action, Crime, Drama",9.0,84.0,When the menace known as the Joker wreaks havo...,2355907,$534.86M
4,4,The Godfather: Part II,1974,202,"Crime, Drama",9.0,90.0,The early life and career of Vito Corleone in ...,1152912,$57.30M


The Unnamed: 0 field corresponds to the ID number of the film in the original dataset, and is equivalent to the index of the film in the dataframe. This field is both irrelevant, and superfluous, so we can go ahead and drop it.

In [5]:
df.drop(columns='Unnamed: 0', inplace=True)

Let us take a look at the data types

In [6]:
df.dtypes

name          object
year          object
runtime        int64
genre         object
rating       float64
metascore    float64
timeline      object
votes         object
gross         object
dtype: object

We need to convert year, votes, and gross to numeric data types for further analysis. Year must be a 4 digit integer. Let us check if there are any values for year that do not fit this constraint.

In [7]:
df.loc[df.year.str.len() != 4].year.value_counts()

I 2015      4
I 2004      3
I 2017      3
I 2014      3
II 2015     2
II 2016     2
I 2011      2
I 2010      2
I 2007      2
I 2013      2
III 2018    1
I 2016      1
I 2008      1
I 1985      1
I 2001      1
III 2016    1
I 2020      1
I 1995      1
Name: year, dtype: int64

We have a few items that include letters in their year value. Let us proceed by removing all alphabetical characters from values in the year field.

In [8]:
df.year = df.year.str.replace(r'\D+', '', regex=True)
# Let us check to see if this was performed correctly
print('The number of non 4 digit integers is', len(df.loc[df.year.str.len() != 4].year))
# Convert to int
df.year = pd.to_numeric(df.year)

The number of non 4 digit integers is 0


Let us repeat this step for the other numerical fields: votes and gross.

In [9]:
df.votes = df.votes.str.replace(r'\D+', '', regex=True)
df.votes = pd.to_numeric(df.votes)

In [10]:
df.gross

0       $28.34M
1      $134.97M
2           NaN
3      $534.86M
4       $57.30M
         ...   
995         NaN
996         NaN
997     $20.00M
998     $30.50M
999         NaN
Name: gross, Length: 1000, dtype: object

For gross, we need to consider that the values are expressed in terms of millions. The characters that follow the decimal point represent fractions of a millions unit. Accordingly, we can choose to either retain the decimal values, and keep this in mind when performing statistical analysis, or to convert the value to regular integers now. Let us go with the first option to save screen real estate. 

In [11]:
def gross_conversion(value):
    """
    Removes all non-numeric characters from value except '.', then multiplies the resulting value by 
    1000000.
    """
    try:
        return float(re.sub(r'[^\d.]+', '', str(value)))*1000000
    except:
        value = np.nan

# Apply function to values in df.gross
df['gross'] = df.gross.apply(gross_conversion)

In [12]:
# Let us confirm the new data types
df.dtypes

name          object
year           int64
runtime        int64
genre         object
rating       float64
metascore    float64
timeline      object
votes          int64
gross        float64
dtype: object

df.describe()