# Let's Get Some Pandas Reps In!

![more_pandas](https://media.giphy.com/media/KyBX9ektgXWve/giphy.gif)

## What is Pandas?

Pandas will be one of the main tools we will use in data science.  The better you get at Pandas, the easier your life will be when we get to the machine learning algorithms in later phases. 

Pandas is a essential library that comes with Anaconda.  Pandas, as [the Anaconda docs](https://docs.anaconda.com/anaconda/packages/py3.7_osx-64/) tell us, offers us "High-performance, easy-to-use data structures and data analysis tools." It's something like "Excel for Python", but it's quite a bit more powerful.

Let's navigate to the Pandas website to view some of its benefits: [pandas](https://pandas.pydata.org/about/)

# Importing data and initial data exploration

Let's first import pandas as pd.

In [16]:
import pandas as pd

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


Now read in the heart dataset.

Pandas has many methods for reading different types of files! Note that here we have a .csv file.

Read about this dataset [here](https://www.kaggle.com/ronitf/heart-disease-uci).

Notice the name of the last column!

In [17]:
heart = pd.read_csv('data/heart.csv')

The output of the `.read_csv()` function is a pandas *DataFrame*, which has a familiar tabaular structure of rows and columns.

In [18]:
heart

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


The .shape attribute of a dataframe shows how many rows and columns are in a dataframe.

In [19]:
heart.shape

(303, 14)

Two main types of pandas objects are the DataFrame and the Series, the latter being in effect a single column––*plus index*––of the former.

But Pandas is built on top of NumPy, and we can always access the NumPy array underlying a DataFrame using `.values`.

In [20]:
heart.values

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

What does .head( ) do? What do you learn about the dataset by using it here?

In [21]:
heart.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


What about .tail( )? What about .info( ) and .describe( )?

In [22]:
heart.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [23]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [24]:
heart.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


## Individual Features/Columns

We can also inspect columns on their own.

What can we figure out / guess about the different columns?

Let's check the data type of one of our columns:

In [25]:
heart['age'].dtype

dtype('int64')

## Statistics

I can use methods like `.mean()`, `.min()`, `.max()` to calculate quick statistics.

In [26]:
heart['oldpeak'].mean()

1.0396039603960396

In [27]:
heart['thalach'].max()

202

I can also sort the values in a column by using `.sort_values()`

In [28]:
heart['age'].sort_values(ascending=False)

238    77
144    76
129    74
151    71
60     71
       ..
65     35
239    35
125    34
58     34
72     29
Name: age, Length: 303, dtype: int64

# Value Counts

How many different values does have slope have? What about sex? And target?

In [29]:
# .value_counts()

heart['slope'].value_counts()

2    142
1    140
0     21
Name: slope, dtype: int64

In [30]:
heart['sex'].value_counts()

1    207
0     96
Name: sex, dtype: int64

In [31]:
heart['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

# Basic Manipulations

## Adding to a DataFrame

Here are two rows that our engineer accidentally left out of the .csv file, expressed as a Python dictionary:

In [32]:
extra_rows = {'age': [40, 30], 'sex': [1, 0], 'cp': [0, 0], 'trestbps': [120, 130],
              'chol': [240, 200],
             'fbs': [0, 0], 'restecg': [1, 0], 'thalach': [120, 122], 'exang': [0, 1],
              'oldpeak': [0.1, 1.0], 'slope': [1, 1], 'ca': [0, 1], 'thal': [2, 3],
              'target': [0, 0]}
extra_rows

{'age': [40, 30],
 'sex': [1, 0],
 'cp': [0, 0],
 'trestbps': [120, 130],
 'chol': [240, 200],
 'fbs': [0, 0],
 'restecg': [1, 0],
 'thalach': [120, 122],
 'exang': [0, 1],
 'oldpeak': [0.1, 1.0],
 'slope': [1, 1],
 'ca': [0, 1],
 'thal': [2, 3],
 'target': [0, 0]}

How can we add this to the bottom of our dataset?

In [33]:
# Let's first turn this into a DataFrame.
# We can use the .from_dict() method.

extras = pd.DataFrame().from_dict(extra_rows)

In [34]:
# Now we just need to concatenate the two DataFrames together.
# Note the `ignore_index` parameter! We'll set that to True.

heart_augmented = pd.concat([heart, extras], ignore_index=True)

In [35]:
# Let's check the end to make sure we were successful!

heart_augmented.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0
303,40,1,0,120,240,0,1,120,0,0.1,1,0,2,0
304,30,0,0,130,200,0,0,122,1,1.0,1,1,3,0


Let's add a new column to our dataset called "test". Set all of its values to 0.

In [36]:
heart['test'] = 0

I can also add columns whose values are functions of existing columns.

How could I add a column, called 'twice_age', that is double the age column?

In [37]:
heart['twice_age'] = 2 * heart['age']

## Filtering

We can use filtering techniques to see only certain rows of our data. If we wanted to see only the rows for patients 70 years of age or older, we can simply type:

In [38]:
heart[heart['age'] >= 70]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,test,twice_age
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2,1,0,142
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2,1,0,142
129,74,0,1,120,269,0,0,121,1,0.2,2,1,2,1,0,148
144,76,0,2,140,197,0,2,116,0,1.1,1,0,2,1,0,152
145,70,1,1,156,245,0,0,143,0,0.0,2,0,2,1,0,140
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2,1,0,142
225,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0,0,140
234,70,1,0,130,322,0,0,109,0,2.4,1,3,2,0,0,140
238,77,1,0,125,304,0,0,162,1,0.0,2,3,2,0,0,154
240,70,1,2,160,269,0,1,112,1,2.9,1,1,3,0,0,140


Use '&' for "and" and '|' for "or".

In [39]:
# Display the patients who are 70 or over as well as the patients whose
# trestbps score is greater than 170.

heart[(heart['age'] >= 70) | (heart['trestbps'] > 170)]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,test,twice_age
8,52,1,2,172,199,1,1,162,0,0.5,2,0,3,1,0,104
25,71,0,1,160,302,0,1,162,0,0.4,2,2,2,1,0,142
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2,1,0,142
101,59,1,3,178,270,0,0,145,0,4.2,0,0,3,1,0,118
110,64,0,0,180,325,0,1,154,1,0.0,2,0,2,1,0,128
129,74,0,1,120,269,0,0,121,1,0.2,2,1,2,1,0,148
144,76,0,2,140,197,0,2,116,0,1.1,1,0,2,1,0,152
145,70,1,1,156,245,0,0,143,0,0.0,2,0,2,1,0,140
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2,1,0,142
203,68,1,2,180,274,1,0,150,1,1.6,1,0,3,0,0,136


## .loc( ) and .iloc( )

We can use .loc( ) to get, say, the first ten values of the age and trestbps columns:

In [40]:
heart.loc[:9, ['age', 'trestbps']]

Unnamed: 0,age,trestbps
0,63,145
1,37,130
2,41,130
3,56,120
4,57,120
5,57,140
6,56,140
7,44,120
8,52,172
9,57,150


.iloc() is used for selecting locations in the DataFrame **by number**:

In [41]:
heart.iloc[3, 0]

56

In [42]:
# How would we get the same slice as just above by using .iloc() instead of .loc()?

heart.iloc[:10, [0, 3]]

Unnamed: 0,age,trestbps
0,63,145
1,37,130
2,41,130
3,56,120
4,57,120
5,57,140
6,56,140
7,44,120
8,52,172
9,57,150


# Pair Exercise: 

Here are three datasets from dataportals.org. 

With a partner, take 10 minutes, and choose one of these urls:
        
- Chicago Data Portal, [food inspections](https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5/data)
- Seattle Data Portal, [public employee wages](https://data.seattle.gov/City-Business/City-of-Seattle-Wage-Data/2khk-5ukd)
- San Francisco Data Portal, [mobile food facility](https://data.sfgov.org/Economy-and-Community/Mobile-Food-Facility-Permit/rqzj-sfat)

- Export the csv data onto your local computer, then start exploring the data. Here are some suggestions for how to proceed.

    1. Create a dataframe using pd.read_csv('path_to_your_file/file.csv'
    2. View the head and tail of the DataFrame. 
    3. Call .info to check the total number of rows/columns, view the datatypes, and see if certain columns have n/a values
    4. Run value_counts on a categorical variable.
    5. Filter the data based on a categorical or continuous variable using the df[df.feature == 'value'] syntax.
    6. Create a new column from an old column
    7. If you have time, create a visualization using matplotlib. 

![pair](https://media.giphy.com/media/FQVZk2elXU14Q/giphy.gif)

For the second half of the lecture, we will use the well-worn titanic dataset.

In [44]:
# The data is in the csv file called titanic.csv
# create a dataframe object using it, and look at the head to start getting familiar with its structure
import pandas as pd
df = pd.read_csv('data/titanic.csv', index_col='PassengerId')

# Learn to interact and manipulate dataframe columns

Let's take a look at the head of the data frame and the shape, just to get a quick overview.

In [45]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [46]:
df.shape

(891, 11)

### Quick knowledge check
We always want to be aware of what a row represents. 

What does each row in the dataframe represent? 

In [47]:
# Type answer here

Like most things code, there are several ways to view columns.

The first way is to look at the columns attribute of the dataframe.

In [48]:
# We are getting familiar with dataframe attributes: .shape and now .columns
df.columns

Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket',
       'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [49]:
# We can confirm that the number of columns matches the second index of the shape attribute

len(df.columns) == df.shape[1]

True

A second way to see the columns is using the built in list() method:

In [50]:
list(df)

['Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

Consider the situation where you want to rename a column in the dataframe. Let's say you are getting tired of remembering that SibSp refers to siblings and spouses. We can rename it like so:

In [51]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1) # Axis tells the rename method to look for SibSp along the columns axis

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,siblings_and_spouses,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...
887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Great. Now print out the head of the df

In [52]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Looks like something did not register.  The column name is back to SibSp. 
A finicky thing about Pandas is the use of inplace.  
In order for the object to be transformed in memory, we need to assign the inplace paramater the value of True

In [53]:
df.rename({'SibSp':'siblings_and_spouses'}, axis=1, inplace=True)

In [54]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,siblings_and_spouses,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can also change multiple columns at once with a dictionary:

In [55]:
df.rename(columns = {'Parch': 'parent_child_ratio', 'Pclass': 'ticket_class'}, inplace=True)

In [56]:
df.head()

Unnamed: 0_level_0,Survived,ticket_class,Name,Sex,Age,siblings_and_spouses,parent_child_ratio,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We can also interact directly with the .columns attribute


In [57]:
df_columns = df.columns # saved for pairprogramming

df.columns = list('ABCDEFGHIJK')
# What will the columns of our dataframe look like now?

If we find a column is not useful, we can drop columns with the drop method.



In [58]:
df.drop('A', axis=1)

Unnamed: 0_level_0,B,C,D,E,F,G,H,I,J,K
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
5,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
887,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
888,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
889,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
890,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


# Pair Program:

Take 5 minutes with a partner to perform this activity.

We just renamed our columns to a useless series of letters. Luckily we saved our column names in the variable df_columns. Let's rename our columns using columns attribute.  To make things neater, we want the column names to all be lowercase.   You can perform this in any way you prefer, but a list comprehension can do it in one line.

Remember, list comprehensions look like this:
> [function(variable) for variable in iterable]

In [61]:
# your answer here
df.columns = [col.lower() for col in df_columns]

## Identify and deal with N/A values

NA (not available) values, are a constant annoyance.  They can mess up our code and our analysis.  One of the first steps of EDA you will perform is looking at whether your data has NA's.  

Apropo to the event it describes, the titanic dataset has many NA values. 

We can see that in a few ways, first using describe.

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   survived              891 non-null    int64  
 1   ticket_class          891 non-null    int64  
 2   name                  891 non-null    object 
 3   sex                   891 non-null    object 
 4   age                   714 non-null    float64
 5   siblings_and_spouses  891 non-null    int64  
 6   parent_child_ratio    891 non-null    int64  
 7   ticket                891 non-null    object 
 8   fare                  891 non-null    float64
 9   cabin                 204 non-null    object 
 10  embarked              889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


## Knowledge Check: From the above info() output, which columns have na's? How can you tell?


Your answer here  


Another way to see na's is with the **isna()** method

In [63]:
df.isna()

Unnamed: 0_level_0,survived,ticket_class,name,sex,age,siblings_and_spouses,parent_child_ratio,ticket,fare,cabin,embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,False,False,False,False,False,False,False,False,False,True,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,True,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...
887,False,False,False,False,False,False,False,False,False,True,False
888,False,False,False,False,False,False,False,False,False,False,False
889,False,False,False,False,True,False,False,False,False,True,False
890,False,False,False,False,False,False,False,False,False,False,False


More usefully, we can sum the values which are na:

In [64]:
df.isna().sum()

survived                  0
ticket_class              0
name                      0
sex                       0
age                     177
siblings_and_spouses      0
parent_child_ratio        0
ticket                    0
fare                      0
cabin                   687
embarked                  2
dtype: int64

## Dealing with na's


One way to deal with na's is by dropping rows that have them:


In [65]:
df.dropna()

Unnamed: 0_level_0,survived,ticket_class,name,sex,age,siblings_and_spouses,parent_child_ratio,ticket,fare,cabin,embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7000,G6,S
12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.5500,C103,S
...,...,...,...,...,...,...,...,...,...,...,...
872,1,1,"Beckwith, Mrs. Richard Leonard (Sallie Monypeny)",female,47.0,1,1,11751,52.5542,D35,S
873,0,1,"Carlsson, Mr. Frans Olof",male,33.0,0,0,695,5.0000,B51 B53 B55,S
880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C
888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S


Let's explore what happened there. Since we didn't include inplace=True, we can run the same code with some additions to see the difference:

In [66]:
df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 2 to 890
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   survived              183 non-null    int64  
 1   ticket_class          183 non-null    int64  
 2   name                  183 non-null    object 
 3   sex                   183 non-null    object 
 4   age                   183 non-null    float64
 5   siblings_and_spouses  183 non-null    int64  
 6   parent_child_ratio    183 non-null    int64  
 7   ticket                183 non-null    object 
 8   fare                  183 non-null    float64
 9   cabin                 183 non-null    object 
 10  embarked              183 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 17.2+ KB


# Knowledge check
How did drop na affect the dataframe?  Why did it remove so many rows?

In [68]:
# your answer here

Dropna without params reduced our data significantly, which is a very bad thing. Our model performance, when we get to modeling, will heavily rely on having enough data.

Let's add a parameter to dropna:

In [69]:
list(df)

['survived',
 'ticket_class',
 'name',
 'sex',
 'age',
 'siblings_and_spouses',
 'parent_child_ratio',
 'ticket',
 'fare',
 'cabin',
 'embarked']

In [70]:
df.dropna(subset=['embarked'], inplace=True)

In [71]:
# Now there are only two columns with na values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 1 to 891
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   survived              889 non-null    int64  
 1   ticket_class          889 non-null    int64  
 2   name                  889 non-null    object 
 3   sex                   889 non-null    object 
 4   age                   712 non-null    float64
 5   siblings_and_spouses  889 non-null    int64  
 6   parent_child_ratio    889 non-null    int64  
 7   ticket                889 non-null    object 
 8   fare                  889 non-null    float64
 9   cabin                 202 non-null    object 
 10  embarked              889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.3+ KB


You will find that data preprocessing presents you with many paths to follow.  You have many choices you can make as to how to preprocess. 

For now let's make the choice to drop cabin, since it has so many nulls:

In [72]:
df.drop('cabin', axis=1, inplace=True)

With age, let's be a bit more creative, and impute the mean. This is a common method.

##  Short Exercise: Turn of your camera and take 3 minutes:

Using the fillna() method, write code below to fill the na's in age with the mean of age.

In [None]:
# Your code here

In [None]:
# Run df.info() to check that you have no more na's.
df.info()