# User Study Notebook 
-----------
# Part 1: EDA

For this activity, you will be asked to explore a dataset using pandas. While you are exploring the dataset, a library called __lux__ will be activated that will suggest visualizations to you. The goal of this library is to track which functions you are using and suggest visualizations that plot the data from these functions. To see visualization recommendations simply execute the name of the pandas dataframe or series you would like to visualize and __lux__ will replace the default output with visualizations.

Our goal is to have you explore the dataset how you normally would in python using __pandas__, and see how well __lux__ is able to recommend useful reccomendations.

As you execute more pandas functions __lux__ will be able to reccomend more visualizations.


In [None]:
import lux
# lux.logger = False
from vega_datasets import data
import pandas as pd

# lux.config.default_display = "lux"

For this activity we will be using a dataset about movies sales over different years with some info about the different movies. To try to replicate a real-world analysis task, this dataset has not been thoroughly cleaned.

In [None]:
df = pd.read_csv("../data/movies.csv")
df['Release_Date'] = pd.to_datetime(df['Release_Date'], infer_datetime_format=True)

In [None]:
df.to_csv("../data/movies.csv", index=False)

Since we modified the `Release_Date` column above, you can see that it is included in our recomendations below.

In [None]:
df

In [None]:
df

Please take a couple of minutes to familiarize yourself with this dataset. You will be asked some specific analysis questions later on but for now take some time to explore the features of the dataset, clean it or reformat it however you like, and become generally comfortable with what is in the data. 

In [None]:
df.describe()

In [None]:
# code to fill or get rid of nas, etc
# some value counts maybe

-----------
# Part 2: Specific Tasks

Great! Hopefully you feel comfortable with the general features of this dataset. You will now be asked a couple of specific questions about the data (that you may or may not have explored during your EDA). 

### 2.1
How many different `MPAA_Rating`s are there? What does the distribution look like?

In [None]:
df.MPAA_Rating.value_counts()

### 2.2
What about Worldwide gross distribution?

In [None]:
df.Worldwide_Gross

### 2.3
What is the average Worldwide and US gross for different `MPAA_Rating` rated movies?

In [None]:
df.groupby("MPAA_Rating").agg({"US_Gross": "mean", "Worldwide_Gross": "mean"})

### 2.4
Which does the distribution of median worldwide gross look like across genres? Which have the highest and lowest median worldwide gross?

In [None]:
df.groupby("Major_Genre").median()

### 2.5
How do highly rated IMDB movies (>5) compare to non-highly rated (in general, answer however you please)

In [None]:
df[df.IMDB_Rating > 5]

### 2.6
How do highly rated `IMDB` and `Rotten tomatoes` ratings influence the other metrics?

In [None]:
df[(df.IMDB_Rating > df.IMDB_Rating.mean()) & (df.Rotten_Tomatoes_Rating > df.Rotten_Tomatoes_Rating.mean())]

------------
# Part 3: Encoding feedback (optional)

Please run the following cells and give feedback on how approiate the reccomended visual encoding is for that function.

In [None]:
df = data.movies()
df['Release_Date'] = pd.to_datetime(df['Release_Date'], infer_datetime_format=True)
df.history.clear()

In [None]:
df

### 3.1: Describe()

In [None]:
df.describe()

### 3.2: Value_Counts()

In [None]:
df.Distributor.value_counts()

In [None]:
df.MPAA_Rating.value_counts()

### 3.3: Filter

In [None]:
df[df.MPAA_Rating == "R"]

In [None]:
df[(df.MPAA_Rating == "R") & (df.Worldwide_Gross > 85343400)]

### 3.4: df agg

In [None]:
df.mean()

In [None]:
df.median()

### 3.5: df groupby agg

In [None]:
df.groupby("MPAA_Rating").mean("Worldwide_Gross")

In [None]:
df.groupby("Distributor").agg({"Worldwide_Gross": "mean", "US_Gross": "mean", "Runtime_min": "median"})