<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Capstone Project: Film Linguistics
## Notebook 1
#### Stephen Strawbridge, Cohort #1019

---
# Section 1 - Background and Problem Statement

---

## Problem Statement

I hypothesize that certain genres of movies are not reaching their full rating potential in part due to a lack of consideration of linguistic statistics in movie scripts.  This project aims to use various prediction models to find the ideal linguistic features of successful movies per genre.

---
# Section 2 - Data Cleaning

---

In [1]:
#Import necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
#Read in dataset
df = pd.read_csv('./Datasets/dataset.csv');

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
#Look at overall info on dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30493 entries, 0 to 30492
Columns: 224 entries, MovieID to Filler-ratio
dtypes: float64(6), int64(120), object(98)
memory usage: 52.1+ MB


In [4]:
#Isolate for columns with null values
nan_cols = [i for i in df if df[i].isnull().any()]
df[nan_cols].isnull().sum()

random_number                 58
plot_summary               28887
made_for                   27198
suspended                  30479
running_time                1877
running_time_comment       27406
country                        7
USAonly_1_other_0              7
rating_dist                  568
rating_votes                 568
rating_rank                  568
CERT_dummycode             16972
cert-west-germany          26482
genre1                       257
genre2                      9686
genre3                     19502
PrimaryGenre_dummycoded      257
dtype: int64

In [5]:
#First, drop unecessary columns for project
df = df.drop(columns=['random_number', 'USAonly_1_other_0', 'CERT_dummycode', 'cert-west-germany'])

In [6]:
#Replace nulls in suspended columns with 'Not suspended'
df['suspended'] = df['suspended'].replace(np.nan, 'Not suspended')

In [7]:
#Map the phrase 'Not provided' to object columns with null values
obj_null_cols = ['plot_summary', 'made_for', 'running_time', 'running_time_comment', 'country']

for col in obj_null_cols:
    df[col] = df[col].replace(np.nan, 'Not provided')

In [8]:
#Because ratings will be a primary target variable in project, rows where rating data is missing will be dropped
df = df[df['rating_dist'].notna()]
df = df[df['rating_votes'].notna()]
df = df[df['rating_rank'].notna()]

In [9]:
#For genres, our dataframe already has the genres dummified, so we can drop the original genre columns
df = df.drop(columns=['genre1', 'genre2', 'genre3', 'PrimaryGenre_dummycoded'])

In [10]:
#Double check that no more nulls exist in dataframe
nan_cols = [i for i in df if df[i].isnull().any()]
df[nan_cols].isnull().sum()

Series([], dtype: float64)

In [11]:
#Save cleaned dataframe to excel
#df.to_excel('./Excels/cleaned_df.xlsx')

---
# Section 3 - Exploratory Data Analysis (EDA)

---

#### Because we are specifically looking at the linguistic characteristic ratios, we will create column list of all ratio features.  The total initial number of ratios is 86

In [14]:
#Create ratio_cols list of all ratio features
ratio_cols = [col for col in df.columns if 'ratio' in col]

#It was noticed that a question mark was present in ratio columns
#The rows with this question mark were dropped
for col in df[ratio_cols]:
    df = df[df[col] != '?']

In [43]:
#Convert columns to floats (they are currently object types)
df[ratio_cols] = df[ratio_cols].astype(float)

#Create df for all ratio features and explore summary stats
ratio_df = df[ratio_cols]
ratio_df.describe()

Unnamed: 0,HarmVirtue-ratio,HarmVice-ratio,FairnessVirtue-ratio,FairnessVice-ratio,IngroupVirtue-ratio,IngroupVice-ratio,AuthorityVirtue-ratio,AuthorityVice-ratio,PurityVirtue-ratio,PurityVice-ratio,...,Home-ratio,Money-ratio,Relig-ratio,Death-ratio,Informal-ratio,Swear-ratio,Netspeak-ratio,Assent-ratio,Nonflu-ratio,Filler-ratio
count,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,...,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0,29921.0
mean,0.001432,0.002836,0.000365,4.7e-05,0.001344,0.000407,0.002984,0.000265,0.000676,0.000803,...,0.004338,0.00597,0.004215,0.004055,0.024876,0.00466,0.004021,0.008119,0.00683,0.000427
std,0.001218,0.002694,0.000462,0.000162,0.001425,0.000781,0.002588,0.000596,0.001089,0.000948,...,0.002855,0.004823,0.005032,0.003883,0.019023,0.006367,0.008088,0.005485,0.007126,0.002845
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.000713,0.001197,0.0,0.0,0.000607,0.0,0.001293,0.0,0.000184,0.000257,...,0.002604,0.002942,0.001496,0.00143,0.012341,0.000691,0.000583,0.004481,0.002181,0.0
50%,0.00119,0.002127,0.00026,0.0,0.001059,0.000174,0.002304,0.000105,0.000449,0.000586,...,0.003887,0.004829,0.002901,0.003022,0.020579,0.002394,0.002108,0.007233,0.004543,0.00025
75%,0.001855,0.003685,0.000521,0.0,0.001696,0.000495,0.003919,0.000325,0.000843,0.00106,...,0.005499,0.00772,0.005061,0.005577,0.033192,0.006082,0.00551,0.010742,0.009317,0.00056
max,0.041667,0.102041,0.01087,0.008637,0.056075,0.029412,0.033557,0.035714,0.057292,0.041667,...,0.131579,0.126824,0.104839,0.12,0.574468,0.076547,0.5,0.137255,0.22069,0.48


In [46]:
#Compare above statistics to the total word count summary statistics
df[['TotalWords.1']].describe()

Unnamed: 0,TotalWords.1
count,29921.0
mean,6118.685639
std,3182.313585
min,1.0
25%,3972.0
50%,5812.0
75%,7963.0
max,60618.0
