# Introduction to Data Science

## Pandas Basics

Pandas is the most popular data analysis library for Python. It's inspired by earlier features of SQL and R, but has continued to progress and add support for the latest hardware technologies (parallel, in-memory, cloud, ...) as well as advanced analysis capabilities.

The fundamental object we'll be using is the DataFrame. This is basically just a table, but with a lot of built-in, powerful data analysis methods.

In [None]:
import pandas as pd

## DataFrames and Series

DataFrames are the table-like type used to store data in Pandas. Series are single columns of data - each column of a DataFrame is a series. You can make a series independently from a DataFrame, for example if you have a list and want to call some analysis methods on it.

In [None]:
groceries = {"item": ["bananas", "apples", "oranges"], "quantity": [4, 2, 8]}

groceries_df = pd.DataFrame(groceries)

print("Dict:\n{}\n".format(groceries))
print("DataFrame:\n{}".format(groceries_df))

In [None]:
prices = pd.Series([3.25, 4.50, 1.75])

You can assign new columns to a DataFrame by writing:
`df["new_column"] = some_data`

In [None]:
groceries_df["price"]= prices

`df.head()` prints the first 6 rows of the DataFrame, `df.tail()` prints the last 6. You can also pass a number of rows, like `df.head(10)` to display a custom number of rows.

In [None]:
groceries_df.head()

In [None]:
# add a subtotal column
groceries_df["subtotal"] = groceries_df.quantity * groceries_df.price

In [None]:
groceries_df.head()

## Selecting Data

In [None]:
# select by column name OR attribute groceries_df.item
groceries_df["item"]

In [None]:
# df.loc is used for label-based indexing
groceries_df.loc[1,"price"]

In [None]:
# df.iloc is used for integer-based indexing
groceries_df.iloc[1, 2]

In [None]:
# you can select column ranges of data by passing a list of columns
groceries_df[["item", "price"]]

In [None]:
# you can select rows the same way using loc or iloc
groceries_df.loc[[0,1],]

## Reshaping Data

Datasets are not always organized the way we want them to be - sometimes we need each row to have a single data point, other times we might want each row to contain multiple data points. This might be for making a plot or producing statistics or a model.

Pandas uses the following concepts to describe dataset layout:
- Index: columns/id to identify a row
- Columns: named columns per row
- Values: measurements/data we want to use

Usually, datasets come in one of two layouts:
- Long: one measurement per row, includes measurement description as a column
- Wide: many measurements per row

Pandas indexing can make reshaping complicated - read more here https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-pivot

In [None]:
# melting is used to make data longer
groceries_df.melt(id_vars=["item"])

In [None]:
# pivot_table can be used to make data wider

# let's save the melted data
groceries_melted = groceries_df.melt(id_vars=["item"])

# use pivot_table to get the data back in the original shape (reset_index() makes item a column instead of an index here)
groceries_melted.pivot_table(index="item", columns="variable", values="value").reset_index()

# Types of data

You might be familiar with types from software engineering - how information is represented and encoded. In data science, it's important to know what kind of data types we have because only some types of data can be used for certain types of analysis. There are 4 main categories of data:

Quantitative data
- **Continuous**, a real number, e.g. temperature, height, 
- **Discrete**, integer data, e.g. number of points scored, number of people

Qualitative data
- **Ordinal**, categories with an order, e.g. high, medium, low income
- **Nominal**, categories with no order, e.g. name, favorite ice cream


# Exploratory Data Analysis (EDA)

As a data scientist, you might know a lot about programming and statistics and have an area of specialty, but you often are asked to use your skills to solve a problem outside of your domain. One of the key skills you need to develop is the ability to explore a dataset so you can get more context about a particular domain.

## Summarizing the data
- Descriptive statistics
- Plotting

In [None]:
# seaborn is a popular data visualization library built with matplotlib
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_context("notebook")
sns.set_theme(style="ticks")

In [None]:
# Load the example dataset for Anscombe's quartet https://en.wikipedia.org/wiki/Anscombe%27s_quartet
df = sns.load_dataset("anscombe")

# let's check out the mean of both variables in each dataset with groupby (more later...)
df.groupby("dataset").mean()

In [None]:
# Show the results of a linear regression within each dataset
sns.lmplot(
    data=df, x="x", y="y", col="dataset", hue="dataset",
    col_wrap=2, palette="muted", ci=None,
    height=4, scatter_kws={"s": 50, "alpha": 1}
)

## Data exploration with penguins dataset

First steps with any dataset
- How much data is there?
- What variables are in the dataset?
- What types are the data?
- Is any data missing?

In [None]:
# load the dataset, like in any programming exercise, choose meaningful variable names!
penguins = sns.load_dataset("penguins")

In [None]:
# how many rows and columns of data do we have? use df.shape attribute
penguins.shape

In [None]:
# df.info() method tells you useful metadata about the data types. what do you notice?
penguins.info()

In [None]:
# if you just want types you can use dtypes
# what types are the different variables?
penguins.dtypes

What's _Object_? Let's look at the first data point and find out. Warning, object columns may have mixed types!

In [None]:
penguins.head(1)

Generate descriptive statistics
- count
- mean: average measurement for whole sample
- standard deviation: average deviation from the mean
- min/max: highest and lowest values in the sample
- percentiles - cutoff values for percent of data when in order, e.g. 75% percentile means 75% of the data is less than this value

In [None]:
# use df.describe() to get descriptive statistics on numerical variables - categorical data doesn't show up here unless you pass "include='all'"
penguins.describe(include='all')

## Calculate some summary statistics and look at groups
### Group By

Group by will help you answer the vast majority of simple data analysis questions. The basic idea is that you group your data by the values of a variable or set of variables, then calculate a statistic of interest like the mean or minimum.

https://pandas.pydata.org/docs/user_guide/groupby.html

In [None]:
penguins.species.unique()

In [None]:
penguins.island.unique()

In [None]:
# mean of each feature for each group
penguins.groupby("species").mean()

In [None]:
# standard deviation of each feature for each group
penguins.groupby("species").std()

In [None]:
# how correlated are our variables? 
penguins.corr().round(2)

## Data Visualization with Seaborn
You should try to make visualizations that will help you understand the data:
- **Histogram** shows how a single variable is distributed across a range
- **Scatter Plot** shows how individual points of data are distributed
- **Box Plot** 

In [None]:
# histogram
sns.histplot(x ="body_mass_g", data=penguins)
plt.title("Body Mass", size=10)

In [None]:
# histogram with categories by species using 'hue' argument
sns.histplot(x ="flipper_length_mm", data=penguins, hue="species")
plt.title("Flipper Length", size=20)

In [None]:
# bar plots are useful for comparing counts and sums or averages, default is average
sns.barplot(x="species", y="flipper_length_mm", data=penguins)
plt.title("Flipper Length for 3 Penguin Species", size=12)

In [None]:
# boxplots show you the distribution of values
sns.boxplot(data=penguins, x="species", y="flipper_length_mm")

In [None]:
# violin plots are like box plots, but using the shape of the distribution instead of a box
sns.violinplot(x="species", y="body_mass_g", data=penguins, hue="sex")
plt.title("Flipper Length for 3 Penguin Species by Sex", size=12)

In [None]:
sns.scatterplot(x="flipper_length_mm", y="body_mass_g", hue="species", data=penguins)

In [None]:
sns.scatterplot(x="flipper_length_mm", y="body_mass_g", hue="island", data=penguins)

In [None]:
# make a correlation heatmap, notice that the variable pair for the scatter plot above has a high correlation of 0.87
sns.heatmap(penguins.corr(), annot = True)

In [None]:
# pairplot looks at pairs of variables, can be useful as a first step
sns.pairplot(penguins, hue = "species", height=2)

# Missing Data

Some machine learning and statistical methods cannot handle missing data. Generally you have two choices for handling this:
- Drop missing data
- Impute missing data

Dropping data can be OK if it's only a small proportion of the overall dataset and/or there are other similar rows with complete data. Dropping data that has otherwise useful information can bias your model and analysis.

Imputing data means inserting a substitute value for the missing data. Common methods are using the mean/median for the variable. You can use group by to impute based on a group if you have that data available. More sophisticated methods can use machine learning to impute missing data. Imputation, like dropping data, can result in biased models and analysis if not done carefully.

Imputation can require trial and error and there is an art to it, it's also imperfect and you need to think about how it might affect the overall analysis.

In [None]:
# do we have missing data?
penguins.isnull().sum()

In [None]:
# let's impute the median for the species for the continuous variables using fillna
penguins["bill_length_mm"].fillna(penguins["bill_length_mm"].median(), inplace=True)
penguins["bill_depth_mm"].fillna(penguins["bill_depth_mm"].median(), inplace=True)
penguins["flipper_length_mm"].fillna(penguins["flipper_length_mm"].median(), inplace=True)
penguins["body_mass_g"].fillna(penguins["body_mass_g"].median(), inplace=True)

In [None]:
# let's impute the missing sex information, first let's make a plot to see if this makes sense with body mass
sns.histplot(x="body_mass_g", hue="sex", data=penguins)

In [None]:
penguins.groupby("sex").median()

In [None]:
# let's look at the rows with missing sex
penguins.loc[penguins["sex"].isnull(),]

In [None]:
# let's use median body mass to impute the missing sex data
penguins.body_mass_g.median()

In [None]:
# we use a boolean index to get the rows that are less than the median
penguins.loc[penguins["sex"].isnull() & (penguins["body_mass_g"] < penguins.body_mass_g.median())]

In [None]:
# this isn't a perfect approach, but since we are doing aggregate analysis, it shouldn't affect the result much
penguins["sex"].where(penguins["sex"].isnull() & (penguins["body_mass_g"] < penguins.body_mass_g.median()), "Female", inplace=True)
penguins["sex"].where(penguins["sex"].isnull() & (penguins["body_mass_g"] >= penguins.body_mass_g.median()), "Male", inplace=True)

In [None]:
# make sure there's no more missing data!
penguins.isnull().sum()