#### DS133: Business intelligence with advanced spreadsheet

***Student Name***: _____________

# Case 6: NumPy, Pandas and basic visualizations

## NumPy

**Recap List** 
A list can hold any type and can hold different types at the same time. You can also change, add and remove elements. This is wonderful, but one feature is missing, a feature that is super important for aspiring data analysts as yourself. When analyzing data, you'll often want to carry out operations over entire collections of values, and you want to do this fast. With lists, this is a problem.

**Solution: NymPy**
It's a Python package that provides an alternative to the regular python list: the Numpy array. The Numpy array is pretty similar to the list, but has one additional feature: you can perform calculations over entire arrays. It's really easy, and super-fast as well. 
We need to import NumPy package as `import numpy as np` (`np` is a convention and shortcut. You can use any keyword for shortcut). 

Compare operations of List and Numpy array below. 

In [None]:
python_list=[1,2,3]
print(type(python_list))
print(python_list+python_list)

In [None]:
import numpy as np
numpy_array=np.array([1,2,3])
print(type(numpy_array))
print(numpy_array+numpy_array)

**Practice**

Find bmi's for given set of weights and heights. Remember
$$
BMI=\frac{weight}{height^2}
$$

In [None]:
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

# calculate bmi array
np_bmi=...

### Subsetting

Subsetting numpy array is similar to the list. We use [ ]. For 2D arrays, [rows, columns] represents between rows and columns. 

![image.png](attachment:ac772468-2f57-4c5b-8c9c-ef4e8ffdb856.png)
![image.png](attachment:b450bd23-8a90-4894-99f7-93e28b9c6ee1.png). 

`array.shape` gives the number of rows and columns

**Practice**

In [None]:
# 2d array
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79], [65.4, 59.2, 63.6, 88.4, 68.7]])

# find number of rows and columns
print(...)

#select 1st row and assign it to variable row1
row1=...
print('row1= ', rows1 )

# select 2-4th columns with all rows and assign it to variable cols2_4
cols2_4=...
print('2-4th columns with all rows is', cols2_4)

### Basic statistics

Once we import any library (for example: `numpy`), we can access any variables or functions (or methods) defined within `numpy`.  

We are now able to access any variables or functions defined within `numpy` by typing the name of the module followed by a dot, then followed by the name of the variable or function we want.

    <module name>.<name>
Some of the basic statistic calculations:
* `np.mean()`  calculates mean
* `np.median()` calculates median 
* `np.max()` gets max number 
* `np.min()` gets min number
* `np.std` calculates standard deviation 


In [None]:
# calculate the standard deviation of second row
np.std(np_2d[1,:])

In [None]:
# Find mean of the first row
...

In [None]:
# Find the median of columns 3 through 4
...

## Pandas
One shortcoming for `numpy` is that it requires all your data to be same type. However, datasets will typically comprise different data types, so we need a tool that's better suited for the job.

```pandas``` (built on the `numpy` package) library is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. 

We will utilize ```DataFrame```: a 2-dimensional *labeled* data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used `pandas` library. There are many ways to create a `DataFrame`. A common way is to import dataset into `DataFrame` directly. 

`pandas` is usually imported as `import pandas as pd`. Below is a good introduction video tutorial  about `pandas`. I suggest going over it at your own pace. 

In [None]:
from IPython.display import YouTubeVideo
# The original URL is: 
# https://www.youtube.com/watch?v=vmEHCJofslg

YouTubeVideo("vmEHCJofslg", width=600, height=400)

### Application: iMDB ratings
iMDB ratings that are greater than 8pts (Oscar nominees?) starting from year 1930 is provided in CSV format. 

1. Read data into `DataFrame` and name it as imdb_df

In [28]:
#import pandas once
import pandas as pd

# Read the csv file
imdb_df=pd.read_csv('imdb.csv')

2. Quickly analyze the data

In [None]:
# .head(n) shows first n rows of a DataFrame. if n is not provided then 5 rows
imdb_df.head()

In [None]:
# .tail(n) shows last n rows of a DataFrame. if n is not provided then last 5 rows
imdb_df.tail(10)

In [None]:
# .info() provides info about columns types and missing values if any
imdb_df.info()

In [None]:
# describe() provides summary statistics for the numerical columns
imdb_df.describe()

Now, try following tasks yourself. Watch the video tutorial above.

3. Sort the data by latest movies first and by alphabetical Title order (A..Z).
4. I am interested watching movies that are highly ranked. Find me the movies that have rating higher than 8.5. 
5. But I decided latest highly rated movies that have ranking not less than 8.3 (inclusive). Show the movies after 2010 (inclusive) ordered from higehest Rating and alphabetical Title (A..Z). 
6. Create a new variable called `mean_dec` that contains decade averages of [Votes, Rating]. *Hint:* use `groupby()` method.


In [None]:
# sort the data by latest movies first and by alphabetical order
...

In [None]:
# movies that have more than 8.5 ranking
...

In [None]:
# movies after 2010, and have ranking higher than 8.3, ordered rank and title
...

In [None]:
mean_dec=...

## Basic visualization

There are many ways to visualize data in python. Two common libraries are `matplotlib` and `seaborn`. 

In [29]:
import matplotlib.pyplot as plt
import seaborn as sns

# (optional) select seaborn style
sns.set_style('darkgrid')

In [None]:
# create a line plot for votes for entire duration
# just run the cell
# change plot size
plt.figure(figsize=(8, 4))
sns.lineplot(data=imdb_df, x='Year',y='Votes')

# Change label of y-axis
plt.ylabel('Votes')


#Show plot
plt.show()

#### Below are some exercises to visualize

1. Create a lineplot for *average* Votes for each decade using your variable `mean_dec`.

In [None]:
plt.figure(figsize=(8, 4))
sns.lineplot(...)
plt.ylabel('Average Votes')
plt.show()

2. Create a Lineplot Ratings for the entire duration

In [None]:
plt.figure(figsize=(8, 4))
sns.lineplot(...)
plt.ylabel('Rating')
plt.show()

3. Create a Barplot for *average* Rating of each decade using your variable `mean_dec`.

In [None]:
plt.figure(figsize=(8, 4))
sns.barplot(...)
plt.ylabel('Average Rating')
plt.show()

4. Create a scatter plot between votes and rankings. (This has been done for you) 

In [None]:
sns.relplot(data=imdb_df, x='Rating', y='Votes', kind="scatter", hue='Decade', style='Decade')
plt.show()

5. Create regression line plot between ratings and votes for each decade. (This has been done for you)  

In [None]:
sns.lmplot(data=imdb_df, x='Rating', y='Votes', hue='Decade')
plt.show()

6. Above regression lines are too convoluted. From your data set, select only decades starting from 1980 (inclusive) and create a similar plot above that contains 4 regression lines for 4 last decades.

### Submission

After everything is complete, download your notebook (.ipynb) and submit it to Canvas