# Basic Statistics and Data Visualization

**This module aims to test your knowledge in calculating basics statistics and preform basic visualizations**

To learn about Basic Stats and Data Visualization, we are going to use a modified version of [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset) which has information about movies

The dataset is located at `data/movies.csv`, and has the following fields

```
    budget: Movie budget (in $).
    genre: Genre the movie belongs to.
    original_language: Language the movie was originally filmed in.
    production_company: Name of the production company.
    production_country: Country where the movie was produced.
    release_year: Year the movie was released.
    revenue: Movie ticket sales (in $).
    runtime: Movie duration (in minutes).
    title: Movie title.
    vote_average: Average rating in MovieLens.
    vote_count: Number of votes in MovieLens.
    release_year: Year the movie was released
```

Importing the necessary packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import math 
import hashlib

Importing data

In [None]:
movies = pd.read_csv("data/movies.csv")

What's the dimension of our data?

In [None]:
movies.shape

Let's see how our data look like

In [None]:
movies.head()

## Basic Statistics

---
### Ex1. How many votes has the most voted movie? What is its `Id`? What is its `Title`? How many votes has the least voted movie? What is its `Id`? What is its `Title`?

In [None]:
# YOUR CODE HERE

In [None]:
number_votes_most_voted_movie_hash = 7289091718542825017
id_most_voted_movie_hash = 6430904777360137614
title_most_voted_movie_hash = 3137659225704790758
number_votes_least_voted_movie_hash = 4685905418773825377
id_least_voted_movie_hash = -4428088901142818636
title_least_voted_movie_hash = 7976510494291908846

assert number_votes_most_voted_movie_hash == hash(str(number_votes_most_voted_movie))
assert id_most_voted_movie_hash == hash(str(id_most_voted_movie))
assert title_most_voted_movie_hash == hash(str(title_most_voted_movie))
assert number_votes_least_voted_movie_hash == hash(str(number_votes_least_voted_movie))
assert id_least_voted_movie_hash == hash(str(id_least_voted_movie)), "If there are multiple "\
"movies with the minimum amount of votes, return the id of the first occurrence."
assert title_least_voted_movie_hash == hash(str(title_least_voted_movie)), "If there "\
"are multiple movies with the minimum amount of votes, return the title of the first occurrence."

---
### Ex2. Find how many movies share the maximum and minimum number of votes.

Remember that `idxmax` and `idxmin` only return the index of the first of occurrence.

In [None]:
# YOUR CODE HERE

In [None]:
number_most_voted_movies_hash = 5147904555119197763
number_least_voted_movies_hash = -7126902051461846454

assert number_most_voted_movies_hash == hash(str(number_most_voted_movies))
assert number_least_voted_movies_hash == hash(str(number_least_voted_movies))

---
### Ex3. Analyse each movie's final Vote Average

 Find the following information:
- What is the minimum and maximum vote average?
- What is the most common vote average?
- What is the average vote average?
- What is the median vote average?
- What is the standard deviation of the vote average?
- What is the variance of the vote average?

In [None]:
# YOUR CODE HERE

In [None]:
maximum_hash = -4180081336580325598
minimum_hash = 4685905418773825377
most_common_hash = -825377147955171141

assert maximum_hash == hash(str(maximum))
assert minimum_hash == hash(str(minimum))
assert most_common_hash == hash(str(most_common))
assert math.isclose(average, 6.18, abs_tol = 0.01)
assert median == 6.3
assert math.isclose(standard_deviation, 1.09, abs_tol = 0.01)
assert math.isclose(variance, 1.21, abs_tol = 0.01)

## Data Visualization

Change the default chart size to 8 inches width and 8 inches height

In [None]:
inches_wide = 8
inches_high = 8
plt.rcParams["figure.figsize"] = [inches_wide, inches_high]

### Note about the grading

Grading plots is difficult, we are using `plotchecker` to grade the plots with nbgrader. 
For `plotchecker` to work with nbgrader, we need to add on each cell, the line

`axis = plt.gca();`

<div class="alert alert-danger">
<b>NOTE:</b>If you get the ImportError, plotchecker not defined, make sure you activate the right environment for this unit!
</div>

**After the code required to do the plot**.

For example, if we want to plot a scatter plot showing the relationship between budget and vote average we would do as follows:

In [None]:
# code required to plot
movies.plot.scatter(x="budget",y="vote_average" )

# last line in the cell required to "capture" the cell and being able to grade it with nbgrader
axis = plt.gca();

---
### Ex4. How does the budget correlate with the revenue?

In [None]:
# YOUR CODE HERE

axis = plt.gca();

In [None]:
from plotchecker import PlotChecker
def get_data(p, ax=0):
    """
    Parses the plotchecker object to get the relevant data for evaluation.
    """
    all_x_data = []
    lines = p.axis.get_lines()
    collections = axis.collections
    if len(lines) > 0:
        all_x_data.append(np.concatenate([x.get_xydata()[:, ax] for x in lines]))
    if len(collections) > 0:
        all_x_data.append(np.concatenate([x.get_offsets()[:, ax] for x in collections]))
    return np.concatenate(all_x_data, axis=0)

pc = PlotChecker(axis)
data = get_data(pc)
assert len(data) == 756, "Did you set the right variables for the plot axes?"
assert set([pc.xlabel] + [pc.ylabel]) == set(["budget", "revenue"]), "Did you set the right variables for the plot axes?"
np.testing.assert_equal(get_data(pc,1), movies.revenue)
print("Success!")

---
### Ex5. How does the average vote count of movies evolves over time? Set the plot title to "Average movie vote count by year"

To calculate the average revenue by year we need to perform an [aggregation](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), pandas support this by doing a technique called [Split-Apply-Combine](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). This will be explained in the Data Wrangling Specialization.

For now we will do the grouping for you:

In [None]:
avg_vote_count_by_year = movies.groupby("release_year")["vote_count"].mean().reset_index()
avg_vote_count_by_year.columns = ["release_year", "avg_vote_count"]
avg_vote_count_by_year.head()

In [None]:
# YOUR CODE HERE

axis = plt.gca();

In [None]:
pc = PlotChecker(axis)
np.testing.assert_equal(get_data(pc), sorted(movies[movies.runtime.notnull()].release_year.unique()))
np.testing.assert_equal(get_data(pc, ax=1), movies.groupby("release_year")["vote_count"].mean())

assert set([pc.xlabel] + [pc.ylabel]) == set(["release_year", "avg_vote_count"]), "Did you set the right variables for the plot axes?"
pc.assert_title_equal("Average movie vote count by year")
print("Success!")

---
### Ex6. How is the variable `runtime` distributed? Show only movies with a maximum runtime of 3 hours, and change the number of bins to 30. Change the bar color to `red`

**hint:** [See here how to change plot options](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html)

In [None]:
# YOUR CODE HERE

axis = plt.gca();

In [None]:
pc = PlotChecker(axis)
pc._patches = np.array(pc.axis.patches)
pc._patches = pc._patches[np.argsort([p.get_x() for p in pc._patches])]
pc.widths = np.array([p.get_width() for p in pc._patches])
pc.heights = np.array([p.get_height() for p in pc._patches])

np.testing.assert_allclose(pc.heights, [  5.,   0.,   0.,   0.,   0.,   0.,   0.,   0.,   1.,   2.,   4.,
        15.,  80., 116., 134., 100.,  79.,  67.,  37.,  26.,  15.,   7.,
        11.,   6.,   4.,   0.,   0.,   1.,   0.,   1.])
np.testing.assert_allclose(pc.widths, [6.9 for i in range(len(pc.widths))])
assert pc.xlim[1] == 180, "Did you read the data dictionary?"
assert pc._patches[0].get_facecolor() == (1., 0., 0., 1.), "Did you change the plot color?"
print("Success!")

---
### Ex7. Make a plot that displays the budget broken by movie language and that allows us to check if there are outliers.

**hint:** [Check this Visualization](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html)

In [None]:
plt.style.use('seaborn')
# YOUR CODE HERE

axis = plt.gca();

In [None]:
pc = PlotChecker(axis)
pc._lines = pc.axis.get_lines()
pc.colors = np.array([pc._color2rgb(x.get_color()) for x in pc._lines])
np.testing.assert_allclose(pc.colors[0],[0.29803922, 0.44705882, 0.69019608])
np.testing.assert_allclose(pc.yticks,np.array([-2.50,0,2.5,5,7.5,10,12.5,15,17.5,20])*1e7)
assert pc.xticklabels == ['en', 'fr', 'hi', 'it', 'ru'], "Did you select the right categorical variable for the plot?"
print("Success!")

In [None]:
print('Well Done!!')