### Final Project: Sprint 2
Rachel Cleal, DS4003

In [2]:
# Import dependencies 
import pandas as pd     # pandas
import seaborn as sns   # seaborn
import plotly.express as px     # plotly

### About The Data
The dataset I am using was scraped from IMDb and includes a collection of 9,083 movies. It can be accessed on Kaggle (https://www.kaggle.com/datasets/elvinrustam/imdb-movies-dataset/data) from Elvin Rustamov.  

I chose to use this dataset because it contains several variables, both categorical and numerical, that I believe will translate into compelling and interactive data visualizations to make an overall engaging and successful app. Additionally, I am a Media Studies major and am interested in comparing and contrasting various aspects of films, including directors, writers, and success at the box office, to name a few.  

In [3]:
# Read in data
df = pd.read_csv("Final_Project\IMDbMovies-Clean.csv")  # imdb movies dataset

### Data Cleaning

In [4]:
df.info()   # Lists columns names and the null counts and datatype for each column

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9083 entries, 0 to 9082
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Title                                9083 non-null   object 
 1   Summary                              9083 non-null   object 
 2   Director                             9052 non-null   object 
 3   Writer                               8759 non-null   object 
 4   Main Genres                          9076 non-null   object 
 5   Motion Picture Rating                8285 non-null   object 
 6   Release Year                         9076 non-null   float64
 7   Runtime (Minutes)                    8918 non-null   float64
 8   Rating (Out of 10)                   8813 non-null   float64
 9   Number of Ratings (in thousands)     8813 non-null   float64
 10  Budget (in millions)                 5879 non-null   float64
 11  Gross in US & Canada (in milli

In [5]:
# Change datatype of columns Title and Summary to string
# Change datatype of columns Director, Writer, Main Genres, and Motion Picture Rating to categorical 
df['Title'] = df['Title'].astype('string')
df['Summary'] = df['Summary'].astype('string')
df['Director'] = df['Director'].astype('category')
df['Writer'] = df['Writer'].astype('category')
df['Main Genres'] = df['Main Genres'].astype('category')
df['Motion Picture Rating'] = df['Motion Picture Rating'].astype('category')

In [6]:
# Convert column Opening Weekend in US & Canada to date type
df['Opening Weekend in US & Canada'] = pd.to_datetime(df['Opening Weekend in US & Canada'])

In [7]:
# Check data for null/missing values
df.isna().sum()     # Checks for missing values then returns the sum for each column

Title                                     0
Summary                                   0
Director                                 31
Writer                                  324
Main Genres                               7
Motion Picture Rating                   798
Release Year                              7
Runtime (Minutes)                       165
Rating (Out of 10)                      270
Number of Ratings (in thousands)        270
Budget (in millions)                   3204
Gross in US & Canada (in millions)     3019
Gross worldwide (in millions)          1955
Opening Weekend in US & Canada         3388
Gross Opening Weekend (in millions)    3388
dtype: int64

In [8]:
# Drop all rows that contain NaN values and reset index 
df.dropna(axis = 0, inplace = True)
df.reset_index(drop=True, inplace =True)

### Exploratory Analysis of Data

In [9]:
# Lists columns names and the non-null counts and datatype for each column
df.info()    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4514 entries, 0 to 4513
Data columns (total 15 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Title                                4514 non-null   string        
 1   Summary                              4514 non-null   string        
 2   Director                             4514 non-null   category      
 3   Writer                               4514 non-null   category      
 4   Main Genres                          4514 non-null   category      
 5   Motion Picture Rating                4514 non-null   category      
 6   Release Year                         4514 non-null   float64       
 7   Runtime (Minutes)                    4514 non-null   float64       
 8   Rating (Out of 10)                   4514 non-null   float64       
 9   Number of Ratings (in thousands)     4514 non-null   float64       
 10  Budget (in m

In [10]:
df.isna().sum()  # Success

Title                                  0
Summary                                0
Director                               0
Writer                                 0
Main Genres                            0
Motion Picture Rating                  0
Release Year                           0
Runtime (Minutes)                      0
Rating (Out of 10)                     0
Number of Ratings (in thousands)       0
Budget (in millions)                   0
Gross in US & Canada (in millions)     0
Gross worldwide (in millions)          0
Opening Weekend in US & Canada         0
Gross Opening Weekend (in millions)    0
dtype: int64

In [11]:
# Provides the number of rows in the data
# In this case, this is the numbers of movies I have in the dataset after deleting the null values 
df.shape[0]

4514

In [12]:
# Check value counts for categorical column Director
df['Director'].value_counts()

Director
Steven Spielberg                                                                                                                                                 32
Ridley Scott                                                                                                                                                     26
Clint Eastwood                                                                                                                                                   24
Martin Scorsese                                                                                                                                                  21
Ron Howard                                                                                                                                                       20
                                                                                                                                                                 ..
Jason A

In [13]:
# Value counts of column Writer
df['Writer'].value_counts()

Writer
John Hughes                                   14
Woody Allen                                   14
Kevin Smith                                   11
M. Night Shyamalan                             9
Steve Kloves,J.K. Rowling                      8
                                              ..
Jesús Franco,Christine Lembach,Connie Grau     0
Jessika Jankert                                0
Jessie Nelson,Karen Leigh Hopkins              0
Jessica Parker,Paul Lalonde,John Patus         0
Éric Besnard,Nicolas Boukhrief                 0
Name: count, Length: 7853, dtype: int64

In [14]:
# Value counts of column Main Genres
df['Main Genres'].value_counts()

Main Genres
Comedy,Drama,Romance          182
Animation,Adventure,Comedy    148
Comedy                        141
Action,Crime,Drama            134
Action,Adventure,Sci-Fi       125
                             ... 
Animation,Family                0
Animation,Drama,Romance         0
Animation,Drama,Music           0
Animation,Drama,Horror          0
Western                         0
Name: count, Length: 460, dtype: int64

In [15]:
# Value counts of column Motion Picture Rating
df['Motion Picture Rating'].value_counts()

Motion Picture Rating
R            2016
PG-13        1497
PG            738
G             109
Not Rated      99
NC-17          13
Unrated        12
Passed          8
TV-MA           7
Approved        6
13+             2
16+             1
M/PG            1
TV-14           1
GP              1
TV-PG           1
18+             1
X               1
MA-17           0
M               0
T               0
TV-G            0
TV-Y            0
TV-Y7           0
TV-Y7-FV        0
Name: count, dtype: int64

In [16]:
# Analyzing distribution Release Year column and looking for outliers 
fig = px.box(df, y='Release Year')
fig.show()

In [17]:
# Analyzing distribution Runtime column and looking for outliers 
fig = px.box(df, y='Runtime (Minutes)')
fig.show()

In [18]:
# Analyzing distribution of Rating column and looking for outliers 
fig = px.box(df, y='Rating (Out of 10)')
fig.show()
# Looks good, all betwen 0 and 10

In [19]:
# Analyzing distribution Release Year column and looking for outliers 
fig = px.box(df, y='Number of Ratings (in thousands)')
fig.show()
# Looks like a majority of movies have around 100 ratings

In [20]:
# Analyzing distribution Release Year column and looking for outliers 
fig = px.box(df, y='Budget (in millions)')
fig.show()
# Looks like the majority of movies have a lower budget, with larger blockbuster movies as the exception/outliers 

In [21]:
# Analyzing distribution Release Year column and looking for outliers 
fig = px.box(df, y='Gross in US & Canada (in millions)')
fig.show()

In [22]:
# Analyzing distribution of Worldwide Gross column and looking for outliers 
fig = px.box(df, y='Gross worldwide (in millions)')
fig.show()
# Outliers in this case are likely blockbuster movies, but more analysis is needed

In [23]:
# Analyzing distribution of Opening Weekend Gross column and looking for outliers 
fig = px.box(df, y='Gross Opening Weekend (in millions)')
fig.show()
# Again, outlier is likely blockbusters

# Keeping outliers in all of these above cases because they represent individual movies and I do no foresee them skewing future calculations

### Data Dictionary
| Variable | Data Type | Definition |
|----------|------------|------------|
| Title | String |The name of the movie |
| Summary | String | A brief overview of the movie's plot |
| Director | Category | The name of the director(s) of the film, the person in charge of overseeing creative aspects of the film |
| Writer | Category | The writer of the film, the person in charge of creating the screenplay |
| Main Genres | Category | The primary genres the movie falls under, including Drama, Comedy, Romance, Adventure, etc. |
| Motion Picture Rating | Category | The age-appropriate classiciation for viewers, includes G (General Audience), PG (Parental Guidance), PG-13 (Parents Cautioned for children under 13), R (Restricted to viewers over 17/18), NC-17 (Restricted to those 17 and older), among others. |
| Runtime | Float | The total duration of the movie, in minutes |
| Release Year | Float | The year in which the movie was officially released |
| Rating | Float | The average score given to the movie by viewers out of 10 |
| Number of Ratings | Float |The total count of ratings submitted by viewers in thousands |
| Budget | Float | The estimated cost of producing the movie in millions |
| Gross in US & Canada | Float | The total earnings from the movie's screenings in the United States and Canada in millions |
| Gross Worldwide | Float | The overall worldwide earnings of the movie in millions |
| Opening Weekend Gross in US & Canada | Datatime64 | The amount generated from screenings of the movie in the initial weekend of the movie's release in the United States and Canada in millions. |
| Gross Opening Weekend | Float | The amount generated from screenings of the movie in the initial weekend of the movie's release in millions. |


Deleted variables: None at this point in time, only deleted rows with null values 

### Possible UI Components:
- Slider for Release Year which connects to a visualization
- Dropdown menu to select Main Genre which connects to a visualization
- Search bar to allow user to type in movie Title and be provided with information about that movie
- Search bar which connects to visualization, user can type in Director/Writer name and get information about his or her movies

### Possible Data Visualizations:
- Bar chart of number of movies vs. Main Genre
    - Could include dropdown for users to select Genres to compare
- Bar chart of average Gross by each Director
    - Can be interactive with user typing in names to compare
- Scatterplot of Budget vs. Gross or Rating
    - Could include sliders for Release Year 
    - Could also include dropdown to select Director or Main Genre 

In [32]:
# Saving data to local files
df.to_csv(r'C:\Users\rrcle\Downloads\DS4003\Final_Project\data.csv')