# User Study Notebook 
-----------
# EDA

For this activity, you will be asked to explore a dataset using pandas. While you are exploring the dataset, a library called __solas__ will be activated that will suggest visualizations to you. __Solas__ tracks which functions you execute and suggests visualizations that plot the data from these functions. To see visualization recommendations simply execute the name of the pandas dataframe or series you would like to visualize and __solas__ will replace the default output with visualizations. __Solas__ learns from your analysis history so as you execute more pandas functions, __solas__ will be able to recommend more visualizations.


Our goal is to have you explore the dataset how you normally would in python using __pandas__, and see how well __solas__ is able to recommend useful reccomendations.

## Introduction
For this activity we will be using a dataset about movies sales over different years with some info about the different movies. 

Imagine you are a machine learning engineer, and try to predict how the movie performs in terms of the worldwide gross from other attributes available in the dataset. To try to replicate a real-world analysis task, this dataset has not been thoroughly cleaned.

We will set some exploratory questions during the whole process to guide you to understand more about the movies dataset. Please feel free to answer them in either figures or texts. 


In [17]:
import solas
import pandas as pd
from vega_datasets import data

# solas config
solas.config.default_display = "solas"
# solas.logger = True

# data load and setup
# df = pd.read_csv("../data/movies.csv")
df = data.movies()
df['Release_Date'] = pd.to_datetime(df['Release_Date'], infer_datetime_format=True)
df['Title'] = df['Title'].astype(str)

df.history.clear()

True

In [18]:
print(df.shape)

(3201, 16)


## Dataset overview

Lets understand the summary statistics of our data to begin

In [19]:
df.describe()

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

#### Are there any null values in the `Worldwide_Gross` column? Try to clean them before proceeding

In [20]:
df.isna()

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [21]:
df = df[~df.Worldwide_Gross.isna()]
df

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [22]:
# df.dropna() 

#### What is the distribution of the `Worldwide_Gross`? 

In [23]:
df.Worldwide_Gross

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

### Explore How `MPAA_Rating` predicts `Worldwide_Gross`

Lets see how well MPAA_Rating interacts with Worldwide Gross

#### How many different `MPAA_Ratings` are there? What does the distribution look like?


In [24]:
df["MPAA_Rating"].value_counts()

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

#### How does the mean of `Worldwide_Gross` differ across different `MPAA_Rating`?

In [25]:
df["MPAA_Rating"]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [26]:
df.groupby("MPAA_Rating").agg({"Worldwide_Gross": "mean"})

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

### Feature selection: Decide between `IMDB_Rating` and `Rotten_Tomatoes_Rating`

At first glance, `IMDB_Rating` and `Rotten_Tomatoes_Rating` are similar featurs so we may only need to choose one of them as our predictor variable. Which one is the better predictor? Let's explore.

+ The general distribution of the features
+ The number of non-null datapoints available
+ The correlation (the predictability) between each feature and the predicted variable

In [27]:
df[["IMDB_Rating", "Rotten_Tomatoes_Rating"]]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [28]:
df[["IMDB_Rating", "Rotten_Tomatoes_Rating", "Worldwide_Gross"]]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [29]:
df[pd.notna(df["IMDB_Rating"])]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [30]:
df[pd.notna(df["Rotten_Tomatoes_Rating"])]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()

In [31]:
#df.corr()

### Explore possible reasons for  the higher variance of the `Worldwide_Gross`


In [32]:
df[(df["IMDB_Rating"] > 6) & (df["IMDB_Rating"] < 8)]

Button(description='Toggle Pandas/Solas', layout=Layout(top='5px', width='150px'), style=ButtonStyle())

Output()