# 1. Introduction to Pandas

## Pandas Basics

Pandas is the most popular data analysis library for Python. It's inspired by earlier features of SQL and R, but has continued to progress and add support for the latest hardware technologies (parallel, in-memory, cloud, ...) as well as advanced analysis capabilities.

The fundamental object we'll be using is the DataFrame. This is basically just a table, but with a lot of built-in, powerful data analysis methods.

## DataFrames and Series

DataFrames are the table-like type used to store data in Pandas. Series are single columns of data - each column of a DataFrame is a series. You can make a series independently from a DataFrame, for example if you have a list and want to call some analysis methods on it.

In [None]:
groceries = {"item": ["bananas", "apples", "oranges"], "quantity": [4, 2, 8]}

groceries_df = pd.DataFrame(groceries)

print("Dict:\n{}\n".format(groceries))
print("DataFrame:\n{}".format(groceries_df))

In [None]:
prices = pd.Series([3.25, 4.50, 1.75])

You can assign new columns to a DataFrame by writing:
`df["new_column"] = some_data`

In [None]:
groceries_df["prices"]= prices

df.head() prints the first 6 rows of the DataFrame

In [None]:
groceries_df.head()

## Indexing and Selecting Data

In [None]:
# select by column name
print(groceries_df["item"])
# OR
print(groceries_df.item)


# 2. Exploratory Data Analysis (EDA)

As a data scientist, you might know a lot about programming and statistics and have an area of specialty, but you often are asked to use your skills to solve a problem outside of your domain. One of the key skills you need to develop is the ability to explore a dataset so you can get more context about a particular domain. I'm guessing most of us don't know much about flowers or botany, so we're going to see what we can learn from the iris dataset!

## Step 1: Describe the data at a high level

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

In [None]:
# load the dataset (built-in to scikit-learn)
iris = load_iris()

# create a DataFrame of the dataset
ir = pd.DataFrame(iris.data)
# set column names
ir.columns = iris.feature_names
# add species information
ir['species'] = iris.target


# look at the head of the dataset
ir.head()

In [None]:
# fix the column names! no spaces or characters!
ir.columns = [x.replace(" ", "_").replace("_(cm)", "") for x in ir.columns]
ir.head()

## Encoding/Decoding Data
Sometimes you want to represent a categorical variable with an integer, like if you're building a model. Other times you might want to use a name, like if you're making a plot or analyzing a data frame. Let's convert the species codes to names!

In [None]:
# print the encoding scheme for species; 0 = Setosa , 1=Versicolor, 2= virginica
print (iris.target_names)

# write a small function to decode the names
def iris_decoder(species_code):
  if species_code == 0:
    return "Setosa"
  elif species_code == 1:
    return "Versicolor"
  else:
    return "Virginica"


In [None]:
# Apply the decoder using a lambda function (inline function) and assign to a new column
ir['species_name'] = ir['species'].apply(lambda x: iris_decoder(x))

In [None]:
ir.head()

In [None]:
# get summary statistics for each column in the dataset
# note that there is no missing data!
ir.describe()

In [None]:
# what types are the different variables?
ir.dtypes

What's _Object_? Let's look at the first data point and find out. Warning, object columns may have mixed types!

In [None]:
type(ir.species_name[0])

## Exercise: Checking types
Write some code that prints the type of each item in the *species_name* column. Hint: you can iterate over the items in a Pandas series...

In [None]:
species_name_types = []

for item in ir.species_name:
  species_name_types.append(str(type(item)))
 
print(species_name_types)

## Step 2: Calculate some summary statistics and look at groups
### Group By

Group by will help you answer the vast majority of simple data analysis questions. The basic idea is that you group your data by the values of a variable or set of variables, then calculate a statistic of interest like the mean or minimum.

In [None]:
# mean of each feature for each group
ir.groupby("species_name").mean()

In [None]:
# max of each feature for each group
ir.groupby("species_name").max()

In [None]:
# how correlated are our variables? 
ir.corr()

## Data Visualization with Seaborn

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
#%matplotlib
# Initialize Figure and Axes object
fig, ax = plt.subplots(figsize=(10,4))
sns.set_context("notebook")

In [None]:
corr = ir.drop("species", axis=1, inplace=False).corr()

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, cmap=cmap, vmax=1.0, center=0.0,
            square=True, linewidths=.1,
            cbar_kws={"shrink": .8})

In [None]:
g = sns.PairGrid(ir, hue="species_name")
g = g.map_diag(plt.hist, histtype="step", linewidth=3)
g = g.map_offdiag(plt.scatter)
g = g.add_legend()

In [None]:
sns.boxplot(data=ir, x="species_name", y="sepal_length")

In [None]:
sns.boxplot(data=ir, x="species_name", y="sepal_width")

In [None]:
sns.scatterplot(x="sepal_length", y="sepal_width", hue="species_name", data=ir)

In [None]:
sns.scatterplot(x="petal_length", y="petal_width", hue="species_name", data=ir)

So now that we have an idea of what the data looks like, let's try to build a model! The most important part of being a professional data scientist is to make sure your model is solving the right problem. Here we can imagine someone discovering a new flower and not knowing what species it is. We can build a model that can predict the species given the measurements of the flower!

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
x_vars = ir[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
target = ir["species"]

clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial', max_iter=500).fit(x_vars, target)

# how did we do?
print(clf.score(x_vars, target))

In [None]:
# K-means, #KNN

In [None]:
# what mistake did we make? overfitting!


## Boston house prices dataset - Regression

Regression models involve making a prediction for a continuous (or almost continuous) variable. Things like temperature, price, number of people watching the Super Bowl, etc... Let's look at the Boston house prices dataset to see if we can build a model to predict the price of a house, which could be useful to real estate agents, urban planners, economists, etc...

In [None]:
from sklearn.datasets import load_boston

In [None]:
boston = load_boston()
print(boston.DESCR)

In [None]:
# create DataFrame and add target column
boston_df = pd.DataFrame(boston.data)
# set column names (do this before adding on the target)
boston_df.columns = boston.feature_names
# add target
boston_df["MEDV"] = boston.target
boston_df.head()

## Quick Check
* What types are the variables?
* Do we have any missing data?

In [None]:
boston_df.dtypes

In [None]:
boston_df.describe()

## EDA: Regression

Since we don't have any defined groups in the data, we could make some, maybe using clustering, but for now let's focus on looking for correlations so we can build a good regression model.

In [None]:
boston_corr = boston_df.corr()

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(boston_corr, cmap=cmap, vmax=1.0, center=0.0,
            square=True, linewidths=.1, cbar_kws={"shrink": .8})

In [None]:
boston_corr.sort_values(by=["TAX"], ascending=False)

In [None]:
#g = sns.PairGrid(boston_df)
#g = g.map_diag(plt.hist, histtype="step", linewidth=3)
#g = g.map_offdiag(plt.scatter)

# Machine Learning Time!

We will be evaluating the regression models using Mean Squared Error = AVERAGE(Prediction - True)^2 and R^2 (explained variance).

## Train/Test Split to avoid overfitting

The biggest difference between descriptive statistics and predictive modeling is that the latter seeks to find a generalizable model that will be good at predicting unseen examples. So our goal isn't just to describe the data, it's to find a pattern that works on new/unseen examples.

Overfitting is when your model finds patterns that are specific to your training data and fail to generalize on new examples. For instance, if I asked everyone in the room their favorite pizza topping, I could build a model that associates name to pizza topping. Like if your name is Sam and your favorite topping is pepperoni, I could build a model that says:

`if name == "Sam":
  return "Pepperoni"`

But this wouldn't be a very good model.

In order to combat overfitting, when we train a model we want to hold back some of our data for testing. This is called a train/test split. If our model performs well on the test data, then we can feel confident we didn't overfit.

## IMPORTANT

When you are training a model on time series data, it is VERY important to not use dates from the future in your training set. For example, if your dataset has data from 2010-2019, you would want to train on 2010-2017 and test on 2018-2019. There's no perfect rule for picking a data to split on, but whatever you do don't randomly sample the whole dataset!

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

bos_x_train, bos_x_test, bos_y_train, bos_y_test = train_test_split(
  boston_df.drop("MEDV", axis=1, inplace=False),
  boston_df["MEDV"],
  test_size=0.33,
  random_state=42)

## Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Create linear regression object
regr = LinearRegression()

# Train the model using the training sets
regr.fit(bos_x_train, bos_y_train)

# Make predictions using the testing set
bos_y_pred = regr.predict(bos_x_test)

# The coefficients
print("COEFFICIENTS:")
for coef in zip(bos_x_train.columns, regr.coef_):
    print(coef[0], "{:.3f}".format(coef[1]))
# The mean squared error
print("Mean squared error: {:.2f}".format(mean_squared_error(bos_y_test, bos_y_pred)))
# Explained variance score: 1 is perfect prediction
print('Variance score: {:.2f}'.format(r2_score(bos_y_test, bos_y_pred)))


### Evaluation

How do we know if this is a good mean squared error? Let's compare to a simple benchmark: the average of the training data prices:

In [None]:
bos_y_train_mean = bos_y_train.mean()
bos_mean_bench = pd.Series([bos_y_train_mean]).repeat(len(bos_y_test))

# The mean squared error
print("Mean squared error: {:.2f}".format(mean_squared_error(bos_y_test, bos_mean_bench)))
# Explained variance score: 1 is perfect prediction
print('Variance score: {:.2f}'.format(r2_score(bos_y_test, bos_mean_bench)))

Cool, so we beat the simplest possible model.

Let's compare to a more sophisticated machine learning model:

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_regr = RandomForestRegressor(
  # We are minimizing MSE
  criterion='mse',
  # Bootstrap
  bootstrap=True,
  # How deep is each tree in the forest?
  max_depth=4,
  # How many trees are in the forest?
  n_estimators=100,
  # Set a random seed so we can reproduce the result
  random_state=0,
  # Do we want to print information to the console?
  verbose=0 #2 YES
)

rf_regr.fit(bos_x_train, bos_y_train)  


#criterion='mse', max_depth=2,
#           max_features='auto', max_leaf_nodes=None,
#           min_impurity_decrease=0.0, min_impurity_split=None,
#           min_samples_leaf=1, min_samples_split=2,
#           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
#           oob_score=False, random_state=0, verbose=0, warm_start=False)


In [None]:
# Make predictions using the testing set
bos_y_pred_rf = rf_regr.predict(bos_x_test)

# The coefficients
print("Feature Importances:")
for coef in zip(bos_x_train.columns, rf_regr.feature_importances_):
    print(coef[0], "{:.3f}".format(coef[1]))

In [None]:
# The mean squared error
print("Mean squared error: {:.2f}".format(mean_squared_error(bos_y_test, bos_y_pred_rf)))
# Explained variance score: 1 is perfect prediction
print('Variance score: {:.2f}'.format(r2_score(bos_y_test, bos_y_pred_rf)))

In [None]:
# Let's compare the models!

## Model Training Pipeline

# Data Wrangling Skills

## How to read data from a file

In [None]:
# this works for a small file :)
with open("example_dataset.csv") as f:
  for line in f:
    print(line)

In [None]:
# this works for a big file (read first N lines)
N = 3
with open("example_dataset.csv") as f:
    head = [next(f) for x in range(N)]

for line in head:
  print(line)

In [None]:
# We use read_csv for csv files. There is also read_excel for Excel files.
example_df = pd.read_csv("example_dataset.csv", sep=",")
example_df.head()

## How to handle missing data

First, see how much data you are missing and where it's missing from!

Then, you can do any/all/none of the following:
1. Drop the missing data
2. Impute the missing data
3. Predict the missing data

In [None]:
example_df.dropna()

In [None]:
# Impute the mean for each group
team_mean_hits = (example_df[["team", "hits"]]
                  .groupby("team")
                  .mean()
                  .reset_index())

team_mean_hits.head()

In [None]:
# select the missing rows
example_df.loc[pd.isna(example_df.hits)]
#example_df.hits.loc[pd.isna(example_df.hits)]

In [None]:
# merge the team mean hits (outer = keep all rows)
baseball_merged = example_df.merge(team_mean_hits, on="team", how="outer", suffixes=("", "_mean"))

In [None]:
baseball_merged["hits_imp"] = baseball_merged.hits.combine_first(baseball_merged.hits_mean)
baseball_merged.head()

In [None]:
# remember where we imputed (this only works BEFORE you impute the missing data!)
example_df["hits_missing"] = pd.isna(example_df.hits)