# Exploratory Data Analysis & Data Visualization

## Import packages and data

In [None]:
%matplotlib inline

from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
import missingno as msno

import matplotlib.pylab as plt

**Task 0**: Load the file Universities.csv. 

Notice the path is different, because we're accessing the file in a different folder.

In [None]:
data_df = pd.???("../resource/lib/public/Universities.csv")

## Inspect the data

**Task 1**: Display the FIRST five rows of the dataframe. 

What if we wanted to see the first 10 rows?

In [None]:
# Put your code in this cell

**Task 2**: Display the LAST five rows of the dataframe. 

In [None]:
# Put your code in this cell

**Task 3**: Display a summary of the columns in the dataframe (column name, non-null count, dtype).

In [None]:
# Put your code in this cell

**Task 4**: Display summary statistics for each of the numerical columns in the dataframe.

In [None]:
# Put your code in this cell

If we want to focus on just a few of these values, we can construct a version of this ourselves. (**Hint**: You'll need this for the homework!)

Notice, this code works by creating a dictionary and then converting it to a dataframe.

In [None]:
pd.DataFrame({'mean': data_df.mean(),
              'sd': data_df.std(),
              'min': data_df.min(),
              'max': data_df.max(),
              'median': data_df.median(),
              'length': len(data_df),
              'miss.val': data_df.isnull().sum()
             })

For categorical variables, it's important to understand how many samples of each type we have in our dataset. If there's significant imbalance, it will be difficult to predict certain groups.  In this dataset, our only categorical variable is Public_or_Private.

**Task 5**: Display the number of Private and Public Universities.  Use the `value_counts` method.

Note that 2 indicates private and 1 indicates public.

In [None]:
# Put your code in this cell

Would we ever use this with continuous variables (floats)? What would happen? You can try this below. There are too many to compare!

In [None]:
# Put your code in this cell

### Rename variables

Remember, some characters cause problems when used in column names. In the code below, we change those characters into underscores. Usually, we have to do this for empty spaces. In this case, I've noticed hyphens, percentage signs and a forward slash, so I also want to remove those.

**Task 6**: Replace the forward slash `/` with an underscore `_` and rename the percentage signs `%` as 'prct'.

What if we didn't do this? Would dot notation work?

In [None]:
# change hyphens into underscores in variable names
data_df.columns = [s.replace('-', '_') for s in data_df.columns] 

# change forward slash into underscore in variable names
data_df.columns = [s.replace('/', ?) for s in data_df.columns] 

# remove percentage sign in variable names
data_df.columns = [s.replace('%', ?) for s in data_df.columns] 

# removing leading/trailing spaces and change remaining spaces into underscores in variable names
data_df.columns = [s.strip().replace(' ', '_') for s in data_df.columns] 

# Alternately, do this in one line of code
# data_df.columns = [s.strip().replace(' ', '_').replace('-', '_').replace('/', '_').replace('%', 'prct') for s in data_df.columns] 

data_df.columns

<div class="alert alert-info">
Your code before and after this point will refer to some variables differently! Be aware of this when rerunning cells or restarting the kernel.    
</div>

## Data Visualization

### Scatterplots

Now that we've inspected our data and fixed the variable names, let's move on to visualizing our data. We'll start with scatterplots, which we use to visualize the relationship between two numerical variables. 

We can create scatterplots using either pandas or matplotlib. We'll use pandas first. Breaking down the code below, we have:
1. **data_df**: specify the dataframe name
1. **plot**: use plot function
1. **scatter**: kind of plot
    + x='variable': variable on x-axis (and axis label)
    + y='variable': variable on y-axis  (and axis label)
    + legend boolean, include legend or not
    + color: of points

Here's the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html) if you want to see more. 

**Task 7** Fill in the ??s below to compare the in-state tuition amount to the graduation rate.

In [None]:
# the basic pandas scatterplot
data_df.plot.scatter(x=??, y=??, legend=False, color='orange')

We can use Matplot lib to create a slightly more sophisticated scatterplot. In addition to the above option, it allows us to set the size and both the inner  and outer color of the point (`facecolor` and `color`, respectively). Here's an example.

In [None]:
# Matplotlib scatterplot

# extracts two elements, fig (the picture itself) and ax (the plot or graph)
fig, ax = plt.subplots() 

# sets the size of the plot to 5 x 3 inches
fig.set_size_inches(5, 3) 

# assigns scatterplot
ax.scatter(data_df.in_state_tuition, data_df.Graduation_rate, color='purple', facecolor='coral') 

# set axis labels, may be different than the variable names
plt.xlabel('Resident Tuition') 
plt.ylabel('Graduation Rate')

# makes the size of the plot the same as the frame (no margin)
plt.tight_layout() 

# required to display the plot
plt.show() 

**Task 8** Recreate the plot above, but change the  size to 10 x 6, change the colors to whatever you like, and fill in any remaining ??s. 

Here's a [list of colors](https://matplotlib.org/3.1.0/gallery/color/named_colors.html) for your reference. (Scroll down to CSS colors at the bottom of the page.)

In [None]:
# Recreate the plot shown above by reentering some parameters and changing others

# extracts two elements, fig (the picture itself) and ax (the plot or graph)
fig, ax = plt.subplots() 

# sets the size of the plot to 10 x 6 inches
fig.set_size_inches(??, ??) 

# assigns scatterplot
ax.scatter(??.in_state_tuition, ??.Graduation_rate, color=??, facecolor=??) 

# set axis labels, may be different than the variable names
plt.xlabel('Resident Tuition') 
plt.ylabel('Graduation Rate')

# makes the size of the plot the same as the frame (no margin)
plt.tight_layout() 

# required to display the plot
plt.show() 

We saw above there seem to be two different groups. Let's see if color can help us figure out what's going on. We'll do this using our lone categorical variable `Public_or_Private`, and we'll color the points differently based on that value.

We do this by using `if` and `else` to specify the color of each point. We'll use one color if `Public_or_Private` equals 2 (private) and another if it equals 1 (public).

**Task 9** Fill in the ??s below with two different colors.

In [None]:
fig, ax = plt.subplots() 
fig.set_size_inches(10, 6) 

ax.scatter(data_df.in_state_tuition, data_df.Graduation_rate, 
           color=[?? if c == 2 else ?? for c in data_df.Public_or_Private]
          ) 

plt.xlabel('Resident Tuition') 
plt.ylabel('Graduation Rate')
plt.tight_layout() 
plt.show() 

Now, let's explore another relationship. We can use our basic pandas scatterplot to do this.

**Task 10:** Display the relationship between the in-state tuition amount (x-axis) and the number of new students (y-axis).

In [None]:
# the basic pandas scatterplot
data_df.plot.scatter(x='??', y='??', legend=False, color='mediumblue')

Notice it looks a little squished? The scales of the variables are different, it makes it hard to see clear patterns. We can address this by using the log function to transform variables.

We do this in two steps. 
1. Add two new variables to our dataset, by taking the log of both the x-axis variable and the y-axis variable. 
    + Use the `log` function in the numpy library. 
    + When taking the log, make sure to add 1, in case the original value is equal to 0. (log(0) is undefined.)
2. Use those new variables when generating the scatterplot.

**Task 11**: Fill in the ??s below to finish defining the new variables and plot them.

In [None]:
# create new variables
data_df['New_students_log'] = np.log((data_df.New_students)+1)

data_df['in_state_tuition_log'] = np.??((data_df.??)+??)

# the basic pandas scatterplot
data_df.plot.scatter(x=??, y=??, legend=False, color='mediumblue')

### Bar Charts

Now, let's turn to bar charts, which we use to compare a statistic for a numerical variable across different groups, given by a categorical variable. In this dataset, we have one categorical variable, so that limits our choices!

Breaking this down:
1. **data_df**: dataframe
1. **groupby()**: categorical value to use to create groups (x-axis)
1. **Graduation_rate**: numerical variable of interest  
1. **mean()**: statistic to calculate for that numerical variable 
1. **plot**: plot function
    + kind: plot type
    + figsize: figure size
    + color: bar color

The code below uses Matplotlib to compare the average graduation rate by public vs private school.

In [None]:
ax = data_df.groupby('Public_or_Private').Graduation_rate.mean().plot(kind='bar', 
                                                                      figsize=[10, 6], 
                                                                      color='orangered')

ax.set_ylabel('Avg. Graduation Rate')

plt.tight_layout()
plt.show()

**Task 12:** Now compare the MEDIAN graduation rate by public vs private school. Copy the code above. What do you need to change?

### Line Graph

To create a line graph, we'll use a different dataset, that captures the daily number of bike rentals and other relevant information about the day, like whether it was a working day, whether, and the type of rentals.

In [None]:
# Import the data
# bike_df = pd.read_csv('../resource/lib/public/bicycle_by_day.csv', squeeze=True)
bike_df = pd.read_csv('bicycle_by_day.csv', squeeze=True)

In [None]:
# Show the first five rows of data
bike_df.head()

In [None]:
# Show a list of the columns, number of values, and Pandas data types
bike_df.info()

In order to create a line graph, we first need to create a new `Date` column with the proper format. we do this using pandas date/time function, as shown below. Let's break down the code we use:
1. **pd**: using the pandas library
2. **to_datetime**: call function to convert argument to pandas datetime object
3. **bike_df.dteday**: dataframe column to be converted
4. **format**: specifies format

No task here, just run the code.

In [None]:
bike_df['Date'] = pd.to_datetime(bike_df.dteday, format='%m/%d/%Y')

To understand what's going on, let's compare our old and new date columns. See the difference?

In [None]:
bike_df[['dteday','Date']].head()

Now, we need to create the time series. We do this by creating a series using the `cnt` column (our numerical variable of interest) with the `Date` column as our index. The `to_numpy` method converts the series to a Numpy array.

**Task 13:** Set the `Date` column as the index.

In [None]:
# Put code here to create the time series for rental
rental_ts = pd.Series(bike_df.cnt.to_numpy(), index=??)

Let's take a look at this time series. (It is a series, not a dataframe.)

In [None]:
print(type(rental_ts)); rental_ts

Finally ready to plot! Below is the code to do that. Let's look at what each part of the code means:

1. **rental_ts**: series
1. **plot**: plot function
    + kind: Not specified! Default is line
    + ylim: range (limits) of the y-axis
    + legend: include legend or not
    + figsize: figure size
    + color: line color

In [None]:
rental_ts.plot(ylim=[0, 10000], 
               legend=False, 
               figsize=[6, 4], 
               color='darkorange')

plt.xlabel('Year')  # set x-axis label
plt.ylabel('Rentals')  # set y-axis label

plt.tight_layout()
plt.show()

### Boxplots

Boxplots are a common type of distribution plot. Below, we show a boxplot that compares the graduation rate for public and private schools.
1. **data_df**: dataframe
1. **boxplot**: call boxplot function
    + column: numerical variable
    + by: categorical variable to group data by

In [None]:
ax = data_df.boxplot(column='Graduation_rate', by='Public_or_Private')

ax.set_ylabel('Graduation Rate')

plt.suptitle('')  # Suppress the overall title
plt.title('')

plt.show()

**Task 14:** What happens if you remove the `by` argument above? Try it in the cell below.

In [None]:
# Put your code in this cell

Next, let's compare a few different variables for public and private schools. Below, I compare book and personal costs, as well as the percentage of students from the top 10 and 25 percent of their high school class.

**Task 15:** Fill in the ??s to specify we want all 4 figures listed in a a row.

In [None]:
fig, axes = plt.subplots(nrows=??, ncols=??, figsize = (10, 6))

data_df.boxplot(column='book_costs', by='Public_or_Private', ax=axes[0])

data_df.boxplot(column='personal_costs', by='Public_or_Private', ax=axes[1])

data_df.boxplot(column='Percent_from_top_10%', by='Public_or_Private', ax=axes[2])

data_df.boxplot(column='Percent_from_top_25%', by='Public_or_Private', ax=axes[3])

for ax in axes:
    ax.set_xlabel('Public or Private Schools')
    
plt.suptitle('')  # Suppress the overall title
plt.tight_layout()  # Increase the separation between the plots

plt.show()

### Histograms

Here's a simple histogram using the seaborn library (another data visualization library based on matplotlib).

Fun fact: internet sleuths have discovered that this is, in fact, [named for Sam Seaborn](https://stackoverflow.com/questions/41499857/seaborn-why-import-as-sns). These are the things I Google while making these notebooks for you.

Let's breakdown the code:
1. **sns:** using the seaborn library
2. **histplot:** call histplot function
    + data_df.variable: numerical variable
    + bins: number of bins, if blank auto-selects
    + color: you guessed it, color!

In [None]:
fig, ax = plt.subplots(1,1)

sns.histplot(data_df.Graduation_rate, 
             bins=10,
             color='orange'
            )

ax.set_title('Grad Rate Distribution', fontsize=20)
ax.set(xlabel='Grad Rate', ylabel='count')

We can add in a kernel density estimator, which shows a smoothed estimate of the population density. Note that `kde` is an argument in the `histplot` function that we used above. The default value of `kde` is False.

**Task 16:** Reuse the code above to create a histogram, and add in the kernel density estimator.

In [None]:
# Put your code in this cell

### Special charts

Correlation tables are a (relatively) concise way to display the correlation between different variables in our dataframe. However, they show a lot of information and can be difficult to digest. We solve this by using a heat map.

In [None]:
data_df.corr()

#### Correlation Heatmap

Below, we use the correlation table defined above, but we add a heatmap. Let's breakdown the code:
1. **sns:** using the seaborn library
2. **heatmap:** call heatmap function
    + ???: first positional argument, the correlation table
    + annot: if True, write data values in each cell
    + fmt: string formatting code for each value, # indicates decimal places
    + cmap: colormap to use, RdBu is another
    
**Task 17:** Fill in the ?? below and display the correlation heatmap.

In [None]:
corr = data_df.corr() # create correlation table

fig, ax = plt.subplots()

fig.set_size_inches(11, 7)

sns.heatmap(??, annot=True, fmt=".1f", cmap="RdBu_r", center=0, ax=ax)

plt.show()

#### Missing Value Analysis

Finally, we can generate a bar chart to quickly see which variables are missing values. This code is simple, and I'm sure we're all tired if we have reached the end, so nothing for you to do except run the cell!

In [None]:
msno.bar(data_df, color='deepskyblue')