# Let's Use Pandas!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

![pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/2880px-Pandas_logo.svg.png)

## Agenda

SWBAT:

- use `pandas` to read in .csv files;
- interact with and manipulate Series and DataFrames;
- identify and deal with N/A values;
- visualize data using built in dataframe methods and `matplotlib`

## What is Pandas?

Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's read in the heart dataset.

Pandas has many methods for reading different types of files. Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

In [None]:
heart_df = pd.read_csv('data/heart.csv')

The output of the `.read_csv()` function is a pandas *DataFrame*, which has a familiar tabaular structure of rows and columns.

In [None]:
type(heart_df)

In [None]:
heart_df

Notice the name of the last column!

## DataFrames and Series

Two main types of pandas objects are the DataFrame and the Series, the latter being in effect a single column of the former:

In [None]:
age_series = heart_df['age']
type(age_series)

Notice how we can isolate a column of our DataFrame simply by using square brackets together with the name of the column!

Both Series and DataFrames have an *index* as well:

In [None]:
heart_df.index

In [None]:
age_series.index

Pandas is built on top of NumPy, and we can always access the NumPy array underlying a DataFrame using `.values`.

In [None]:
heart_df.values

## Basic DataFrame Attributes and Methods

### `.head()`

In [None]:
heart_df.head()

### `.tail()`

In [None]:
heart_df.tail()

### `.info()`

In [None]:
heart_df.info()

### `.describe()`

In [None]:
heart_df.describe()

### `.dtypes`

In [None]:
heart_df.dtypes

### `.shape`

In [None]:
heart_df.shape

## Adding to a DataFrame


### Adding Rows

Here are two rows that our engineer accidentally left out of the .csv file, expressed as a Python dictionary:

In [None]:
extra_rows = {'age': [40, 30], 'sex': [1, 0], 'cp': [0, 0], 'trestbps': [120, 130],
              'chol': [240, 200],
             'fbs': [0, 0], 'restecg': [1, 0], 'thalach': [120, 122], 'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 'slope': [1, 1], 'ca': [0, 1], 'thal': [2, 3],
              'target': [0, 0]}
extra_rows

How can we add this to the bottom of our dataset?

In [None]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

missing = pd.DataFrame().from_dict(extra_rows)
missing

In [None]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.

heart_augmented = pd.concat([heart_df, missing],
                           ignore_index=True)

In [None]:
# Let's check the end to make sure we were successful!

heart_augmented.tail()

### Adding Columns

Adding a column is very easy in `pandas`. Let's add a new column to our dataset called "test", and set all of its values to 0.

In [None]:
heart_augmented['test'] = 0

I can also add columns whose values are functions of existing columns.

Suppose I want to multiply the cholesterol column ("chol") by the max heart rate column ("thalach"):

In [None]:
heart_augmented['chol+heart'] = heart_augmented['chol'] * heart_augmented['thalach']

## Filtering

We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [None]:
heart_augmented[heart_augmented['age'] >= 70]

Use '&' for "and" and '|' for "or".

In [None]:
# Display the patients who are 70 or over as well as the patients whose
# trestbps score is greater than 170.



### `.loc` and `.iloc`

We can use `.loc` to get, say, the first ten values of the age and resting blood pressure ("trestbps") columns:

In [None]:
heart_augmented.loc

In [None]:
heart_augmented.loc[:9, ['age', 'trestbps']]

`.iloc` is used for selecting locations in the DataFrame **by number**:

In [None]:
heart_augmented.iloc

In [None]:
heart_augmented.iloc[3, 0]

In [None]:
# How would we get the same slice as just above by using .iloc() instead of .loc()?



## Statistics

### `.mean()`

In [None]:
heart_augmented.mean()

Be careful! Some of these will are not straightforwardly interpretable. What does an average "sex" of 0.682 mean?

### `.min()`

In [None]:
heart_augmented.min()

### `.max()`

In [None]:
heart_augmented.max()

## Series Methods

### `.value_counts()`

How many different values does have slope have? What about sex? And target?

In [None]:
heart_augmented['slope'].value_counts()

### `.sort_values()`

In [None]:
heart_augmented['age'].sort_values()

## The Titanic Dataset

In [None]:
titanic = pd.read_csv('data/titanic.csv')

In [None]:
titanic.columns

In [None]:
titanic.shape

In [None]:
titanic.sample(3)

### Renaming Columns

In [None]:
titanic.rename({'SibSp':'siblings_and_spouses'}, axis=1).head(2)

## Dealing with NAs / NaNs

Values can be missing for lots of reasons, so we'll have lots of occasions to deal with this issue. And we'll need to deal with it: In general we can't have null values if we're going to use our data for building and testing models.

There are several ways we might go about it. The simplest strategy is just to drop the rows or columns that contain the nulls.

In [None]:
titanic.isnull().sum()

There are lots of nulls in "Cabin", so we might just drop that column altogether. By contrast there are only a couple nulls in "Embarked", so in that case we might keep the column but just drop the rows that have nulls in that column.

What about "Age"?

Another strategy is to keep the offending cells but somehow fill them in artificially. This is obviously a bit risky, since we are in effect just making up data, but sometimes it makes sense to fill in null values with the mean or the median of the relevant column. This sort of "filling-in" strategy is called **imputation**.

Let's try filling in the "Age" nulls with the median of that column.

Then we could write:

In [None]:
no_emb = titanic.drop('Cabin', axis=1)
no_emb.shape

In [None]:
no_emb['Age'] = no_emb['Age'].fillna(np.nanmedian(no_emb['Age']))

In [None]:
no_nulls = no_emb.dropna()
no_nulls.info()

## Plotting

In [None]:
no_nulls.plot('Age', 'Fare', kind='scatter');

## Let's find a .csv file online and experiment with it.

I'm going to head to [dataportals.org](https://dataportals.org) to find a .csv file.