**YOUR NAME**

Spring 2024

CS 251: Data Analysis and Visualization

# Lab 4a | Pandas and Data Transformations

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

plt.style.use(['seaborn-v0_8-colorblind', 'seaborn-v0_8-darkgrid'])
plt.rcParams.update({'font.size': 10})
plt.rcParams.update({'figure.figsize': [6,6]})

np.set_printoptions(suppress=True, precision=5)

# Automatically reload external modules
%load_ext autoreload
%autoreload 2

## Task 1:  Introduction to Pandas

### 1a. Import and manipulate Bad Drivers dataset with pandas


In this task, we will get familiar with the [pandas](https://pandas.pydata.org) module. The main data type in pandas is called a [DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented). Think of `DataFrame` as a direct replacement/substitute for your `Data` class. In fact, starting with Project 4, we will use pandas instead of your `Data` class. As you will soon see, you can easily convert back and forth between pandas DataFrame objects and NumPy ndarrays.

To explore pandas, we will be working with the [Bad Drivers dataset](https://www.kaggle.com/datasets/fivethirtyeight/fivethirtyeight-bad-drivers-dataset). See the description of the Bad Driver dataset csv headers below.
1. Import the dataset using pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to create a Dataframe called `df` from the url:<br/>https://raw.githubusercontent.com/mwaskom/seaborn-data/master/car_crashes.csv 
2. Using the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) documentation:
    1. Print the shape
    2. To get a list of the column headers and print them.
    3. In one method call get the first 5 items and print the results.
    4. Select the variable `total` and in one line print that column of data (*If not all 51 are displayed under the cell, that is ok*).
3. Print the means of **only** the variables `ins_premium` and `ins_losses`.
4. Create a new DataFrame `df_prnt` which includes only the variables that involve percentages: `speeding`, `alcohol`, `not_distracted`, and `no_previous`.
   1. Update `df_prnt` so that each variable is min-max normalized. This means subtracting the minimum and dividing by the extent so that each variable ranges from 0-1.
   2. Print out the mins and maxs of each variable in `df_prnt`. If everything is working as expected, the mins of each variable should be `0` and the maxs should be `1`.
   3. Print out the means of each column in the `df_prnt` DataFrame.

**Note:** You should not need to use any loops to perform the above steps.

### Bad Drivers dataset headers

header | description
------- | ------------
'total' | Number of drivers involved in fatal collisions per billion miles
'speeding' | Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding
'alcohol' | Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired
'not_distracted' | Percentage Of Drivers Involved In Fatal Collisions Who Were Not Distracted
'no_previous' | Percentage Of Drivers Involved In Fatal Collisions Who Had Not Been Involved In Any Previous Accidents
'ins_premium' |  Insurance Premiums ($)
'ins_losses' | Losses incurred by insurance companies for collisions per insured driver ($)
'abbrev' | State


In [None]:
url="https://raw.githubusercontent.com/mwaskom/seaborn-data/master/car_crashes.csv"

# 1.

# 2.


In [None]:
# 3.


# 4.



### 1b. Plot insurance premium associated with total number of driver fatalities in each state

1. Using the data in DataFrame `df`, create a scatter plot with the total number of driver fatalities on the x axis and the insurance premium on the y axis.
2. Label the x and y axis appropriately.
3. [Annotate](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.annotate.html) each marker with the state that the sample is associated with.
   1. If you need to, you can use [plt.gca()](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.gca.html) to get the current axis.
   2. In a loop over the number of samples, call [annotate](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.annotate.html) on the axis to label the marker at $(x_i, y_i)$ with the state associated with sample $i$.
   3. Call `plt.show()` like usual after the loop completes.

## Task 2. Data Transformations

### 2a. Load and plot Happy dataset

1. Use pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) to create a Dataframe of the Happy dataset located in `data/happy.csv`.
2. Use [to_numpy](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html) to convert the DataFrame to a numpy ndarray. Name the ndarray `happy_xy`.
3. Use the provided matplotlib code in the cell below to show a scatter plot of the Happy dataset. It should look like a tall, thin, vertical line centered at `x=2` and `y` ranges from ~0 to ~10.

In [None]:
# Your code here


# Keep and run the following to plot happy_xy
plt.plot(happy_xy[:,0], happy_xy[:,1], 'o')
plt.gca().axis('equal')
plt.show()

### 2b. Make the dataset happy

In the cell below:

1. Make a [copy](https://numpy.org/doc/stable/reference/generated/numpy.copy.html) of `happy_xy` and call it `xy`.
2. Your task is to apply the below data transformations to the Happy dataset look like an right-side-up happy face 🙂. Your final happy face when plotted should have the following properties:
   - The face should be centered at the origin. There is no nose in this face, but where the nose *would be* should approximately line up with `(0, 0)`.
   - The extent of the data in `x` and `y` should each be `1`. This means that the final face outline should be a circle (i.e. not stretched out / elongated) and the min/max in both `x` and `y` should be `[-0.5, 0.5]`.
   - The face should look right-side up 🙂.

You should use **each** of the 3 following transformations **once** to get your happy face (*not necessarily in this order!*). **Work with your copy `xy` rather than `happy_xy` so that the original dataset doesn't change!** Otherwise, you will need to load in the data from the CSV file every time before performing your data transformations.

#### Centering

A translation that makes the mean of each data variable `0`:

$$data = data - \vec{\mu}$$

where $data$ is the dataset and $\vec{\mu}$ is a 1D ndarray of length `M` (i.e. number of variables) that contains the mean of each variable the dataset.

#### Min-Max Normalization

A combination of translation and scaling transformations that changes the range of each data variable so that they are all between `0` and `1` (inclusive):

$$data = \frac{data - \vec{mins}}{\vec{maxs} - \vec{mins}}$$

where $data$ is the dataset, $\vec{mins}$ is a 1D ndarray of length `M` that contains the mins of each variable the dataset, and $\vec{maxs}$ is a 1D ndarray of length `M` that contains the mins of each variable the dataset.

#### 2D Rotation

A rotation of 2D data about the origin `(0, 0)`:

```
data = (R2 @ data.T).T
```

where $data$ is the dataset and $R2$ is a 2D ndarray (shape: `(2, 2)`):

$$
R2 = \begin{bmatrix}
\cos(\theta) & -\sin(\theta) \\
\sin(\theta) & \cos(\theta)
\end{bmatrix}$$

where $\theta$ is the angle to rotate the dataset by in **radians**.

#### Reminders

- The `np.sin` and `np.cos` functions assume that you are passing in the angle in **radians**, not degrees. So it might be helpful to use [np.deg2rad](https://numpy.org/doc/stable/reference/generated/numpy.deg2rad.html). 
- You should be computing the means, mins, and maxs *for each variable*. Think about which axis (if any) that the computations should be applied over.

In [None]:


# Keep and run the following to plot happy_xy
plt.plot(xy[:,0], xy[:,1], 'o')
plt.gca().axis('equal')
plt.show()

## Turn in your lab

Follow the usual submission format and submit your lab on Google Classroom.