# Lesson 1.3: Intro to Pandas

### Lesson Duration: 3 hours

> Purpose: The purpose of this lesson is to understand the different types of data that we come across as analysts. The students will learn conducting similar data wrangling operations that were implemented in MS Excel including reading data into Python, blending data, working with columns, filtering, and subsetting operations.

---

### Learning Objectives

- Merge or blend data from multiple sources using Pandas in Python
- Work with NumPy and Pandas, both Python essential libraries for data analysis.
- Perform basic data pre-processing operations on dataframes.

> :exclamation: Important note: It is important to emphasize that having sharp business acumen is a very important skill for data analysts.
> An analyst should have good knowledge about the business function that they are helping with the analysis, asking the right set of questions to the clients/stakeholders involved, understanding what are the factors that are driving the business. It is also critical to understand the limitations of the analysis based on the availability of data, do we have the right data to answer the questions that we are trying to solve or would we need more data, and if yes, data on what parameters would we need.

### Lesson 1 key concepts

> :clock10: 20 min

Features and labels, data types in data analysis

- Numerical data
- Categorical data (nominal, ordinal)
- Text data for NLP
- Media data - image, video, voice/speech

<details>
  <summary> Engagement strategies for using pandas library, reading files and data blending </summary>

:exclamation: This might be the first time the students will use a Python library.

> Give an introduction to the library, aliasing, keywords, methods available.
> Open the files using a text editor and show the students how the files are stored based on the file extensions. Ex. the difference between files such as `CSV`, tab-separated values, pipe separated values, etc.
> We are using 4 files in this lesson. Show the students how to use `concat` with 2 files (as you will see in the following Code Sample, you can use `file1.csv` and `file2.txt`), and ask them to read and `concat` the other two files themselves as hands-on practice.
> At this stage it is important to move slowly as the students would be fairly new to using Python libraries. Emphasize the importance to go through the documentation for different functions, etc. At different stages, it is also important to emphasize that the students should play around with the code and analyze the changes in the output.

</details>

- Reading data into Jupyter and data cleaning operations in Python using pandas and NumPy:
  - Introduction to pandas
  - Reading flat files and data blending
  - Standardizing, renaming headers

In [2]:
import pandas as pd  # explain keywords are highlighted in green, other strings in red, etc.
import numpy as np

# Reading data
file1 = pd.read_csv('./files_for_lesson_and_activities/file1.csv')
file1.head()
file2 = pd.read_csv('./files_for_lesson_and_activities/file2.txt', sep = '\t')  # Show them to read and concat 2 files , other two will be done by students

In [4]:
# Data blending
column_names = file1.columns
data = pd.DataFrame(columns=column_names)
data = pd.concat([data,file1, file2], axis=0)
data.shape

![Diagram for concatenation](https://education-team-2020.s3-eu-west-1.amazonaws.com/data-analytics/1.3+-+Axes+Explain+-+Data.jpg)

In [6]:
data.columns

Index(['CONTROLN', 'STATE', 'GENDER', 'HV1', 'IC1', 'IC4', 'HVP1', 'IC5',
       'POBC1', 'POBC2', 'IC2', 'IC3', 'AVGGIFT', 'TCODE', 'DOB', 'DOMAIN',
       'TARGET_D'],
      dtype='object')

In [7]:
# Standardizing header names
cols = []
for i in range(len(data.columns)):
    cols.append(data.columns[i].lower())
data.columns = cols

# renaming columns
data = data.rename(columns={ 'controln':'id','hv1':'median_home_val', 'ic1':'median_household_income'})

In [8]:
# Standardizing header names
cols = []
for c in data.columns:
    cols.append(c.lower())
data.columns = cols

# renaming columns
data = data.rename(columns={ 'controln':'id','hv1':'median_home_val', 'ic1':'median_household_income'})

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 15 min (+ 5 min Review)

# 1.03 Activity 1

Refer to the folder `files_for_activities` for this exercise. **You can continue using the same jupyter notebook that you craeted in class (where you worked on file1 and file2). Please save your work (jupyter notebook) as we will build on the same activities later.**

1. Load data (`file3.xlsx` and `file4.xlsx`) in a new Jupyter notebook. You might face the error saying that optional dependency _xlrd_ is missing. In that case, they should install it using `pip`. If you don't get the error, move to the next step. :smile:
2. Print data columns for both files.
3. Check the names and order of columns in the files, and compare them with the "data" DataFrame created in class.
4. Change the names of required columns in the new dataframes read before concatenating the files with data.
5. Change data columns from uppercase to lowercase.


_We will merge the dataframes in the next exercise_

In [22]:
# some students might get error saying that optional dependency openpyxl is missing
# in that case, use the following line:
# pip install openpyxl

In [9]:
import pandas as pd

file3 = pd.read_excel("./files_for_lesson_and_activities/file3.xlsx", engine="openpyxl")
file4 = pd.read_excel('./files_for_lesson_and_activities/file4.xlsx', engine="openpyxl")

In [17]:
file3.columns
file4.columns

Index(['CONTROLN', 'STATE', 'GENDER', 'HV1', 'IC1', 'IC4', 'HVP1', 'IC5',
       'POBC1', 'POBC2', 'IC2', 'IC3', 'AVGGIFT', 'TCODE', 'DOB', 'DOMAIN',
       'TARGET_D'],
      dtype='object')

In [11]:
file3 = file3.rename(columns={ 'controln':'id','hv1':'median_home_val', 'ic1':'median_household_income'})
file4 = file4.rename(columns={ 'controln':'id','hv1':'median_home_val', 'ic1':'median_household_income'})

In [18]:
cols1=[]
for c in file3.columns:
    cols1.append(c.lower())

file3.columns=cols1

In [19]:
# Similarly for the other dataframe:
cols2=[]
for c in file4.columns:
    cols2.append(c.lower())

file4.columns=cols2

### Lesson 2 key concepts

> :clock10: 20 min

Data wrangling/cleaning using Python:

- Deleting columns
- Rearranging columns
- Filtering and subsetting

In [21]:
# deleting columns
data = data.drop(['tcode'], axis=1) # Explain the argument axis, when axis is 0 and 1

In [22]:
# Rearranging columns
data = data[['id', 'state', 'gender', 'median_home_val', 'median_household_income', 'ic2', 'ic3', 'ic4', 'ic5', 'avggift', 'domain', 'dob', 'target_d']]

In [23]:
data

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic2,ic3,ic4,ic5,avggift,domain,dob,target_d
0,44060,FL,M,AAA896,392,430.0,466,520.0,21975,28.000000,C2,1901,100.0
1,96093,IL,M,537.00,365,415.0,410,473.0,19387,5.666667,T2,0,7.0
2,43333,FL,F,725.00,301,340.0,361,436.0,18837,4.111111,C2,2501,5.0
3,21885,NC,M,AAA1095,401,407.0,399,413.0,14014,27.277778,T2,2208,38.0
4,190108,FL,F,995.00,252,280.0,316,348.0,17991,6.000000,C2,0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1010,161838,CA,F,1953,304,353.0,337,380.0,13811,13.500000,C2,4212,14.0
1011,161838,CA,F,1953,304,353.0,337,380.0,13811,13.500000,C2,4212,14.0
1012,138311,AZ,Female,1708,437,586.0,551,684.0,29098,9.769231,S1,1403,20.0
1013,123469,TX,M,561,493,529.0,506,540.0,16623,5.200000,T2,0,5.0


In [30]:
# filtering and subsetting -- using conditions with DataFrame
data[data['gender']=='M']
data[data['gender'].isin(['M', 'F'])]
data[(data['gender']=='M') | (data['gender']=='F')]
data[(data['target_d']>100) & (data['target_d']<200)]

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic2,ic3,ic4,ic5,avggift,domain,dob,target_d
220,191779,FL,M,1432,636,693.0,680,772.0,35544,25.0,0,2701,150.0
846,12573,WA,F,1375,285,510.0,330,443.0,18247,63.035714,0,803,102.0


#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 15 min (+ 5 min Review)

# 1.03 Activity 2

(_Keep using the same notebook you used in the activity 1._)

1. Merge the 'data' dataframe you created in class with the dataframes for 'file3' and 'file4' you created in the last activity. If you are not able to merge the dataframes, please take a look at the order of columns and/or the shape of the dataframes (specifically the number of columns in the dataframes). Also check the shape of the new dataframe
2. Drop the columns `domain` and `dob`
3. Rearrange the columns by placing columns `ic2`, `ic3`, `ic4`, `ic5` before the column `median_home_val`
4. Filter the rows for men that live in Florida. Do not store the results in data.
5. Filter the rows for female donors that have donated less than 100. Do not store the results in data.

In [31]:
#1
data = pd.concat([data, file3, file4], axis=0)
data.shape

(4028, 20)

In [32]:
# 2
data = data.drop(['domain', 'dob'], axis =1) # Explain the argument axis, when axis is 0 and 1

In [33]:
# 3
data = data[['id', 'state', 'gender', 'ic2', 'ic3', 'ic4', 'ic5', 'median_home_val', 'median_household_income', 'avggift', 'target_d']]

In [34]:
# 4
data[(data['gender']=='M') & (data['state']=='FL')]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
0,44060,FL,M,430.0,466,520.0,21975,AAA896,392,28.000000,100.0
34,11011,FL,M,270.0,257,269.0,10511,559,232,12.611111,20.0
101,37472,FL,M,239.0,290,301.0,9747,825,233,20.000000,20.0
110,185030,FL,M,274.0,288,326.0,15008,824,248,10.153846,10.0
137,13239,FL,M,270.0,233,332.0,16019,3083,166,2.916667,3.0
...,...,...,...,...,...,...,...,...,...,...,...
962,,FL,M,360.0,358,379.0,12869,,,9.500000,20.0
973,,FL,M,406.0,200,423.0,11670,,,9.230769,10.0
992,,FL,M,272.0,324,313.0,12509,,,10.115385,10.0
997,,FL,M,263.0,262,274.0,11132,,,14.692308,20.0


In [35]:
# 5
data[(data['target_d']<100) & (data['gender']=='F')]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
2,43333,FL,F,340.0,361,436.0,18837,725.00,301,4.111111,5.0
4,190108,FL,F,280.0,316,348.0,17991,995.00,252,6.000000,5.0
8,173223,CA,F,206.0,235,250.0,8708,,184,8.818182,10.0
9,157988,CA,F,,619,617.0,17838,AAA2294,593,6.666667,10.0
10,141720,NV,F,672.0,785,781.0,25775,1569,673,13.000000,5.0
...,...,...,...,...,...,...,...,...,...,...,...
991,,WI,F,354.0,335,376.0,11623,,,7.666667,7.0
993,,TX,F,750.0,626,837.0,31480,,,6.916667,10.0
996,,MI,F,336.0,339,388.0,12653,,,8.533333,5.0
998,,CA,F,504.0,538,537.0,16165,,,12.117647,22.0


### Lesson 3 key concepts

> :clock10: 20 min

More data wrangling/cleaning with Python:

- Reset index
- Working with indexes

In [36]:
#filter and reset the index

# In this section again emphasize on the importance of playing with the code and checking the output
filtered = data[data['gender']=='M']  # Lets say that we are working on this filtered data

In [39]:
filtered

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
0,44060,FL,M,430.0,466,520.0,21975,AAA896,392,28.000000,100.0
1,96093,IL,M,415.0,410,473.0,19387,537.00,365,5.666667,7.0
2,21885,NC,M,407.0,399,413.0,14014,AAA1095,401,27.277778,38.0
3,100640,IL,M,477.0,480,501.0,16022,764.00,457,25.571429,30.0
4,119038,TX,M,525.0,551,560.0,17872,890.00,519,6.175000,7.0
...,...,...,...,...,...,...,...,...,...,...,...
1469,,CA,M,609.0,612,653.0,24745,,,12.333333,21.0
1470,,MI,M,264.0,269,299.0,10088,,,17.142857,25.0
1471,,FL,M,406.0,200,423.0,11670,,,9.230769,10.0
1472,,ND,M,298.0,274,299.0,10186,,,5.266667,5.0


In [38]:
# filtered
filtered = filtered.reset_index(drop=True)
# temp = filtered.copy()
# temp.set_index('state') # This is a dummy case, but indexes should be unique and not nulls, usually auto-increments by 1

In [42]:
# Working with indexes
filtered[1:4]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
1,96093,IL,M,415.0,410,473.0,19387,537.00,365,5.666667,7.0
2,21885,NC,M,407.0,399,413.0,14014,AAA1095,401,27.277778,38.0
3,100640,IL,M,477.0,480,501.0,16022,764.00,457,25.571429,30.0


In [43]:
filtered[['gender', 'ic2', 'ic3']][0:10]

Unnamed: 0,gender,ic2,ic3
0,M,430.0,466
1,M,415.0,410
2,M,407.0,399
3,M,477.0,480
4,M,525.0,551
5,M,458.0,349
6,M,588.0,650
7,M,260.0,312
8,M,168.0,180
9,M,317.0,342


In [40]:
filtered.loc[1:3]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
1,96093,IL,M,415.0,410,473.0,19387,537.00,365,5.666667,7.0
2,21885,NC,M,407.0,399,413.0,14014,AAA1095,401,27.277778,38.0
3,100640,IL,M,477.0,480,501.0,16022,764.00,457,25.571429,30.0


In [41]:
filtered.loc[100]   

id                         124379
state                          TX
gender                          M
ic2                           186
ic3                           204
ic4                           219
ic5                          6576
median_home_val               308
median_household_income       161
avggift                       4.5
target_d                        5
Name: 100, dtype: object

In [44]:
filtered.iloc[1:3]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
1,96093,IL,M,415.0,410,473.0,19387,537.00,365,5.666667,7.0
2,21885,NC,M,407.0,399,413.0,14014,AAA1095,401,27.277778,38.0


In [45]:
# now, working just on the indexes row,columns
filtered.iloc[1:10,0:4]

Unnamed: 0,id,state,gender,ic2
1,96093,IL,M,415.0
2,21885,NC,M,407.0
3,100640,IL,M,477.0
4,119038,TX,M,525.0
5,87259,MT,M,458.0
6,115823,TX,M,588.0
7,95701,IL,M,260.0
8,5172,IL,M,168.0
9,152486,CA,M,317.0


In [46]:
filtered.iloc[[1,2,3,4],[0,2,4]]

Unnamed: 0,id,gender,ic3
1,96093,M,410
2,21885,M,399
3,100640,M,480
4,119038,M,551


#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 5 min Review)

# 1.03 Activity 3

(_Keep using the same notebook you used in the activity 1._)

1. Filter the results for women, and store the results in another DataFrame `filtered2`.
2. Check the first 10 rows of the DataFrame using the `head()` function.
3. Reset the index of `filtered2` with and without using the parameter `drop=True` and check the difference in the results.
4. Show the rows from index number 100 to 200.
5. Use `iloc` to get the first 100 rows and columns with indexes 2,3,4,5.

In [47]:
# 1
filtered2 = data[data['gender']=='F']

In [48]:
# 2
filtered2.head(10)

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
2,43333,FL,F,340.0,361,436.0,18837,725.00,301,4.111111,5.0
4,190108,FL,F,280.0,316,348.0,17991,995.00,252,6.0,5.0
8,173223,CA,F,206.0,235,250.0,8708,,184,8.818182,10.0
9,157988,CA,F,,619,617.0,17838,AAA2294,593,6.666667,10.0
10,141720,NV,F,672.0,785,781.0,25775,1569,673,13.0,5.0
11,186272,CA,F,565.0,549,588.0,20068,3515,521,8.64,10.0
12,154301,Cali,F,470.0,491,496.0,13803,1026,459,11.25,10.0
14,188304,KS,F,303.0,287,316.0,10755@,376,263,10.55,10.0
16,31977,FL,F,599.0,615,702.0,28124,1513,521,18.5,28.0
17,44336,FL,F,365.0,368,411.0,17728,948,330,15.0,15.0


In [49]:
# 3
filtered2.reset_index()
filtered2 = filtered2.reset_index(drop=True)

In [50]:
# 4
filtered2[100:201]

Unnamed: 0,id,state,gender,ic2,ic3,ic4,ic5,median_home_val,median_household_income,avggift,target_d
100,45101,FL,F,424.0,407,494.0,18738,1173,322,4.125000,3.0
101,36528,FL,F,373.0,376,400.0,9980,934,323,25.000000,25.0
102,133906,ID,F,261.0,262,294.0,10244,451,225,15.000000,10.0
103,12226,CO,F,273.0,272,302.0,10417,649,238,5.411765,8.0
104,8696,OK,F,439.0,433,456.0,13115,590,410,13.000000,11.0
...,...,...,...,...,...,...,...,...,...,...,...
196,122167,TX,F,262.0,303,294.0,9346,439,255,12.400000,10.0
197,106639,NE,F,422.0,428,477.0,13753,681,387,15.000000,15.0
198,152053,California,F,694.0,704,746.0,24844,3071,626,12.533333,25.0
199,142495,California,F,375.0,379,444.0,19170,2250,340,11.400000,21.0


In [51]:
# 5
filtered2.iloc[0:101,2:6]

Unnamed: 0,gender,ic2,ic3,ic4
0,F,340.0,361,436.0
1,F,280.0,316,348.0
2,F,206.0,235,250.0
3,F,,619,617.0
4,F,672.0,785,781.0
...,...,...,...,...
96,F,610.0,659,669.0
97,F,701.0,838,805.0
98,F,321.0,300,353.0
99,F,363.0,341,388.0


### Lesson 4 key concepts

> :clock10: 20 min

Data cleaning operations with Python

- Correcting data types
- Removing duplicates

In [86]:
# data types
data.dtypes

id                          object
state                       object
gender                      object
ic2                        float64
ic3                         object
ic4                        float64
ic5                         object
median_home_val             object
median_household_income     object
avggift                    float64
target_d                   float64
dtype: object

In [52]:
data._get_numeric_data()

Unnamed: 0,ic2,ic4,avggift,target_d
0,430.0,520.0,28.000000,100.0
1,415.0,473.0,5.666667,7.0
2,340.0,436.0,4.111111,5.0
3,407.0,413.0,27.277778,38.0
4,280.0,348.0,6.000000,5.0
...,...,...,...,...
1001,424.0,470.0,14.285714,50.0
1002,406.0,423.0,9.230769,10.0
1003,298.0,299.0,5.266667,5.0
1004,386.0,397.0,11.400000,14.0


In [88]:
data._get_bool_data()

0
1
2
3
4
...
1001
1002
1003
1004
1005


In [55]:
data.select_dtypes('object')

Unnamed: 0,id,state,gender,ic3,ic5,median_home_val,median_household_income
0,44060,FL,M,466,21975,AAA896,392
1,96093,IL,M,410,19387,537.00,365
2,43333,FL,F,361,18837,725.00,301
3,21885,NC,M,399,14014,AAA1095,401
4,190108,FL,F,316,17991,995.00,252
...,...,...,...,...,...,...,...
1001,,FL,F,450,15356,,
1002,,FL,M,200,11670,,
1003,,ND,M,274,10186,,
1004,,WI,male,295,12315,,


In [56]:
data.index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
             996,  997,  998,  999, 1000, 1001, 1002, 1003, 1004, 1005],
           dtype='int64', length=4028)

In [57]:
# correcting data types
data['median_home_val'] =  pd.to_numeric(data['median_home_val'], errors='coerce')

In [58]:
data['ic5'] =  pd.to_numeric(data['ic5'], errors='coerce')

In [None]:
# data._get_numeric_data() # to check if 'median_home_val' and 'ic5' are now listed as numeric data

In [59]:
# Removing duplicates
data = data.drop_duplicates()  # play around with the code, show them how to use keep argument
# temp = temp.drop_duplicates(subset=['state','gender', 'ic2', 'ic3'])
# if we want to remove duplicates based on some specific columns

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 5 min Review)

# 1.03 Activity 4

(_Keep using the same notebook you used in the activity 1._)

1. Check the data types of all the columns in the DataFrame.
2. Use `select_dtypes()` to select all the numerical columns in the DataFrame (both integers and floats).
3. Convert the columns that have numerical values (which are now object types) to the numeric type.
4. Remove duplicates from the DataFrame if any.

In [60]:
# 1
data.dtypes

id                          object
state                       object
gender                      object
ic2                        float64
ic3                         object
ic4                        float64
ic5                        float64
median_home_val            float64
median_household_income     object
avggift                    float64
target_d                   float64
dtype: object

In [61]:
# 2
data.select_dtypes(['int32', 'float64'])

Unnamed: 0,ic2,ic4,ic5,median_home_val,avggift,target_d
0,430.0,520.0,21975.0,,28.000000,100.0
1,415.0,473.0,19387.0,537.0,5.666667,7.0
2,340.0,436.0,18837.0,725.0,4.111111,5.0
3,407.0,413.0,14014.0,,27.277778,38.0
4,280.0,348.0,17991.0,995.0,6.000000,5.0
...,...,...,...,...,...,...
996,336.0,388.0,12653.0,,8.533333,5.0
997,263.0,274.0,11132.0,,14.692308,20.0
998,504.0,537.0,16165.0,,12.117647,22.0
999,609.0,653.0,24745.0,,12.333333,21.0


In [65]:
# 3
data['median_household_income'] =  pd.to_numeric(data['median_household_income'], errors='coerce')

In [66]:
data['ic3'] =  pd.to_numeric(data['ic3'], errors='coerce')

In [67]:
# 4
data = data.drop_duplicates()

### :pencil2: Practice on key concepts - Lab

> :clock10: 45 min

# Lab | Customer Analysis Round 1

#### Remember the process:

1. Case Study
2. Get data
3. Cleaning/Wrangling/EDA
4. Processing Data
5. Modeling
6. Validation
7. Reporting

### Abstract

The objective of this data is to understand customer demographics and buying behavior. Later during the week, we will use predictive analytics to analyze the most profitable customers and how they interact. After that, we will take targeted actions to increase profitable customer response, retention, and growth.

For this lab, we will gather the data from 3 _csv_ files that are provided in the `files_for_lab` folder. Use that data and complete the data cleaning tasks as mentioned later in the instructions.

### Instructions

- Read the three files into python as dataframes
- Show the DataFrame's shape.
- Standardize header names.
- Rearrange the columns in the dataframe as needed
- Concatenate the three dataframes
- Which columns are numerical?
- Which columns are categorical?
- Understand the meaning of all columns
- Perform the data cleaning operations mentioned so far in class

  - Delete the column education and the number of open complaints from the dataframe.
  - Correct the values in the column customer lifetime value. They are given as a percent, so multiply them by 100 and change `dtype` to `numerical` type.
  - Check for duplicate rows in the data and remove if any.
  - Filter out the data for customers who have an income of 0 or less.

### Additional Resources

- [String functions - documentation](https://docs.python.org/2.5/lib/string-methods.html)