# User Study Notebook 
-----------
# EDA

For this activity, you will be asked to explore a dataset using pandas. While you are exploring the dataset, a library called __lux__ will be activated that will suggest visualizations to you. The goal of this library is to track which functions you are using and suggest visualizations that plot the data from these functions. To see visualization recommendations simply execute the name of the pandas dataframe or series you would like to visualize and __lux__ will replace the default output with visualizations.

Our goal is to have you explore the dataset how you normally would in python using __pandas__, and see how well __lux__ is able to recommend useful reccomendations.

As you execute more pandas functions __lux__ will be able to reccomend more visualizations.


## Evaluation Prototype

### Introduction
For this activity we will be using a dataset about movies sales over different years with some info about the different movies. 

Imagine you are a machine learning engineer, and try to predict how the movie performs in terms of the worldwide gross from other attributes available in the dataset. To try to replicate a real-world analysis task, this dataset has not been thoroughly cleaned.

We will set some exploratory questions during the whole process to guide you to understand more about the movies dataset. Please feel free to answer them in either figures or texts. 


In [2]:
import lux
lux.logger = True
import pandas as pd
lux.config.default_display = "lux"

# data load and setup
df = pd.read_csv("../data/movies-sample.csv")
df['Release_Date'] = pd.to_datetime(df['Release_Date'], infer_datetime_format=True)
df.history.clear()

In [None]:
print(df.history) # should be empty

In [None]:
print(df.shape)

### 1. Understand the dependent variable: Worldwide_Gross
#### 1.1 Are there any null values in the `Worldwide_Gross` column? Try to clean them before proceeding
> For now, there is no null value in this column, but ideally there should be. The user might benefit from our plots for the `dropna()` function calls. 

> Will - there are 5 nulls in worldwise gross, maybe we want more and so should change our data sample above.


In [None]:
# 1.1 code v1
df = df[~df.Worldwide_Gross.isna()]

In [None]:
df

In [None]:
# 1.1 code v2
df.dropna() # right now every single row has at least one value null so this returns an empty df I think... this is probably bad?

#### 1.2 What is the distribution of the `Worldwide_Gross`? 
> Because of the recommendation (though for now, we do not log the column information when the column is specified as the `subset` parameter), the distribution of our dependent variable appears as the first one in the `Distribution` tab. 


In [None]:
# 1.2 code
df.Worldwide_Gross # this breaks for me?? throws a loc error

In [None]:
# 1.2 code
df # worldwide gross shows up first

### 2. Explore How `MPAA_Rating` predicts `Worldwide_Gross`
After examining features in the dataset either simply from their names or from the figures we present in the `Distribution` or `Occurrence` tab, we are going to see how these features predict the `Worldwide_Gross` in turn and determine what features should be used when building our machine learning model.

#### 2.1 How many different `MPAA_Ratings` are there? What does the distribution look like?
> Here our advantage is to present relevant information in a graph instead of placid numbers. 


In [None]:
df["MPAA_Rating"].value_counts()

#### 2.2 How does the mean of `Worldwide_Gross` differ across different `MPAA_Rating`?
> By exploring the `enhance` tab, the answer is readily available, though some experienced users may instinctively use `pd.groupby()` to solve this problem. In the later case, we could still present satisfactory figures in the `Column Groups` tab. 

In [None]:
df.corr() # why would they call corr for this question?

In [None]:
df.groupby("MPAA_Rating").mean()

In [None]:
df.groupby("MPAA_Rating").agg({"MPAA_Rating": "mean"})

### 3. Feature selection: Decide between `IMDB_Rating` and `Rotten_Tomatoes_Rating`
 > In the second section, the user has already been familiar with how to explore a certain feature, so it should be safe to give them enough freedom to explore other features at this stage.

From the first impression, `IMDB_Rating` and `Rotten_Tomatoes_Rating` are similar features, and we may only need to choose one of them as our predictor variable. Then the question is which one is more desirable? There are several dimensions that are worthy taking into consideration.
#### 3.1 The number of non-null datapoints available

#### 3.2 The general distribution of the feature
> Actually, without writing more codes, users could find relevant figures in the recommendation tabs. 

#### 3.3 The correlation (the predictability) between each feature and the predicted variable
> Are we going to override the function call `df["Worldwide_Gross"].corr(df["IMDB_Rating"])` to make it more convenient for users?

#### 3.4 The relationship between `IMDB_Rating` and `Rotten_Tomatoes_Rating`: do they provide roughly the same information?
> Here, after exploring these two attributes separately, the correlation graph between these two attributes appear the fist and we discover that these two features are generally highly correlated, so it will be more efficient to just use one in the model prediction.

In [None]:
newdf = df[pd.notna(df["IMDB_Rating"])]
newdf

In [None]:
newdf = df[pd.notna(df["Rotten_Tomatoes_Rating"])]
newdf

In [None]:
print(df["Rotten_Tomatoes_Rating"].corr(df["Worldwide_Gross"]))
print(df["IMDB_Rating"].corr(df["Worldwide_Gross"]))

### 4 Explore possible reasons for  the higher variance of the `Worldwide_Gross`
From the correlation between the `Worldwide_Gross` and the `IMDB_Rating`, we find that generally speaking, with the `IMDB_Rating` increasing, the `Worldwide_Gross` for movies goes higher. However, for movies whose ratings fall into 6-8, the variance of the `Worldwide_Gross` is really high. Therefore, we plan to explore this subset of the original dataset, and understand what are the possible reasons.  
> Although the real-world dataset is more complicated, what I have in mind in this section is that we modify the dataset so that we could explain the higher variance in this group by some nominal features, for example, the movie type. In this case, the `enhance` tab could somehow infer from the user's past attention to the `Worldwide_Gross` and the `IMDB_Rating` and draw a figure which in addition shows this nominal feature as the color channel.

In [None]:
newdf = df[(df["IMDB_Rating"] > 6) & (df["IMDB_Rating"] < 8)]
print(newdf["IMDB_Rating"].corr(newdf["Worldwide_Gross"]))

In [None]:
df