# Lesson 1.4: Data Cleaning - Pandas

### Lesson Duration: 3 hours

> Purpose: The purpose of this lesson is to make students familiar with other data cleaning processes including working with null values in numerical and categorical columns, use lambda expression and map functions, and work with DateTime format.

---

### Setup

To start this lesson, students should have:

- Completed lesson 1.3
- All previous Setup

### Learning Objectives

After this lesson, students will be able to:

- Use different methods for replacing null values
- Create user-defined functions for cleaning numerical and categorical columns
- Create anonymous functions/lambda expressions and use them with map functions
- Work with DateTime format and string functions

---

### Lesson 1 key concepts

> :clock10: 20 min

Handling null values in the dataframe

- Removing columns with many null values (threshold based)
- Replacing/imputing null values (numerical)

> :exclamation: We will keep working on the same data sets from 1.03 so you can use the same notebook, or use the code below that is commented out to load the data into a new notebook.

In [9]:
import pandas as pd

file1 = pd.read_csv('./files_from_activity_1.03/file1.csv')
file2 = pd.read_csv('./files_from_activity_1.03/file2.txt', sep = '\t')
file3 = pd.read_excel('./files_from_activity_1.03/file3.xlsx', engine='openpyxl')
file4 = pd.read_excel('./files_from_activity_1.03/file4.xlsx', engine='openpyxl')
column_names = file1.columns
data = pd.DataFrame(columns=column_names)
data = pd.concat([file1,file2,file3, file4], axis=0)
cols = []
for colname in data.columns:
    cols.append(colname.lower())
data.columns = cols
data = data.rename(columns={'controln':'id',
                            'hv1':'median_home_val',
                            'ic1':'median_household_income'})
data['median_home_val'] =  pd.to_numeric(data['median_home_val'], errors='coerce')
data['ic5'] =  pd.to_numeric(data['ic5'], errors='coerce')
data = data.drop_duplicates()

data.shape

(4001, 17)

In [10]:
data.isna().sum() 

id                           0
state                        0
gender                     133
median_home_val             10
median_household_income      0
ic4                          1
hvp1                         0
ic5                          6
pobc1                        0
pobc2                        0
ic2                          1
ic3                          0
avggift                      0
tcode                        0
dob                          0
domain                       0
target_d                     0
dtype: int64

In [11]:
round(data.isna().sum()/len(data),4)*100  # shows the percentage of null values in a column

id                         0.00
state                      0.00
gender                     3.32
median_home_val            0.25
median_household_income    0.00
ic4                        0.02
hvp1                       0.00
ic5                        0.15
pobc1                      0.00
pobc2                      0.00
ic2                        0.02
ic3                        0.00
avggift                    0.00
tcode                      0.00
dob                        0.00
domain                     0.00
target_d                   0.00
dtype: float64

In [12]:
round(data.isna().sum()/len(data),4)*100  # shows the percentage of null values in a column

id                         0.00
state                      0.00
gender                     3.32
median_home_val            0.25
median_household_income    0.00
ic4                        0.02
hvp1                       0.00
ic5                        0.15
pobc1                      0.00
pobc2                      0.00
ic2                        0.02
ic3                        0.00
avggift                    0.00
tcode                      0.00
dob                        0.00
domain                     0.00
target_d                   0.00
dtype: float64

In [13]:
nulls_df = pd.DataFrame(round(data.isna().sum()/len(data),4)*100)

In [14]:
nulls_df
nulls_df = nulls_df.reset_index()
nulls_df
nulls_df.columns = ['header_name', 'percent_nulls']
nulls_df

Unnamed: 0,header_name,percent_nulls
0,id,0.0
1,state,0.0
2,gender,3.32
3,median_home_val,0.25
4,median_household_income,0.0
5,ic4,0.02
6,hvp1,0.0
7,ic5,0.15
8,pobc1,0.0
9,pobc2,0.0


- Since there were not many null values in the dataframe, we are taking a dummy value of 3 to show the students how to remove columns with the percentage of null values that are more than a threshold.
- This is usually 60, 70% but it is very case-specific. There is no standard rule and it varies based on the analysts' decisions for that analysis.

In [15]:
columns_drop = nulls_df[nulls_df['percent_nulls']>3]['header_name']  # dummy case with 3
print(columns_drop.values)
# data = data.drop(columns_drop, axis=1)  # drop a list of columns
# data = data.drop(['gender'], axis=1)  # drop a single column

['gender']


In [16]:
data[data['gender'].isna()==False]

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic4,hvp1,ic5,pobc1,pobc2,ic2,ic3,avggift,tcode,dob,domain,target_d
0,44060,FL,M,,392,520.0,7,21975.0,6,16,430.0,466,28.000000,1,1901,C2,100.0
1,96093,IL,M,537.0,365,473.0,0,19387.0,1,89,415.0,410,5.666667,0,0,T2,7.0
2,43333,FL,F,725.0,301,436.0,3,18837.0,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,21885,NC,M,,401,413.0,7,14014.0,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,190108,FL,F,995.0,252,348.0,0,17991.0,5,6,280.0,316,6.000000,28,0,C2,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,66762,MI,F,632.0,279,388.0,2,12653.0,1,71,336.0,339,8.533333,0,0,0,5.0
997,6443,FL,M,595.0,252,274.0,0,11132.0,8,11,263.0,262,14.692308,1,2501,T2,20.0
998,151175,CA,F,2707.0,507,537.0,80,16165.0,24,54,504.0,538,12.117647,0,4001,U1,22.0
999,151504,CA,M,2666.0,535,653.0,63,24745.0,22,45,609.0,612,12.333333,1,4401,S1,21.0


In [17]:
# Replacing/imputing null values
data[data['gender'].isna()==True] # checking rows that are null based on a specific column
data = data[data['ic2'].isna()==False] # Since these nulls are not a lot, we can filter them
data

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic4,hvp1,ic5,pobc1,pobc2,ic2,ic3,avggift,tcode,dob,domain,target_d
0,44060,FL,M,,392,520.0,7,21975.0,6,16,430.0,466,28.000000,1,1901,C2,100.0
1,96093,IL,M,537.0,365,473.0,0,19387.0,1,89,415.0,410,5.666667,0,0,T2,7.0
2,43333,FL,F,725.0,301,436.0,3,18837.0,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,21885,NC,M,,401,413.0,7,14014.0,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,190108,FL,F,995.0,252,348.0,0,17991.0,5,6,280.0,316,6.000000,28,0,C2,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,66762,MI,F,632.0,279,388.0,2,12653.0,1,71,336.0,339,8.533333,0,0,0,5.0
997,6443,FL,M,595.0,252,274.0,0,11132.0,8,11,263.0,262,14.692308,1,2501,T2,20.0
998,151175,CA,F,2707.0,507,537.0,80,16165.0,24,54,504.0,538,12.117647,0,4001,U1,22.0
999,151504,CA,M,2666.0,535,653.0,63,24745.0,22,45,609.0,612,12.333333,1,4401,S1,21.0


In [18]:
# import numpy
import numpy as np
mean_median_home_value = np.mean(data['median_home_val'])
data['median_home_val'] = data['median_home_val'].fillna(mean_median_home_value)

In [19]:
data

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic4,hvp1,ic5,pobc1,pobc2,ic2,ic3,avggift,tcode,dob,domain,target_d
0,44060,FL,M,1157.329241,392,520.0,7,21975.0,6,16,430.0,466,28.000000,1,1901,C2,100.0
1,96093,IL,M,537.000000,365,473.0,0,19387.0,1,89,415.0,410,5.666667,0,0,T2,7.0
2,43333,FL,F,725.000000,301,436.0,3,18837.0,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,21885,NC,M,1157.329241,401,413.0,7,14014.0,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,190108,FL,F,995.000000,252,348.0,0,17991.0,5,6,280.0,316,6.000000,28,0,C2,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
996,66762,MI,F,632.000000,279,388.0,2,12653.0,1,71,336.0,339,8.533333,0,0,0,5.0
997,6443,FL,M,595.000000,252,274.0,0,11132.0,8,11,263.0,262,14.692308,1,2501,T2,20.0
998,151175,CA,F,2707.000000,507,537.0,80,16165.0,24,54,504.0,538,12.117647,0,4001,U1,22.0
999,151504,CA,M,2666.000000,535,653.0,63,24745.0,22,45,609.0,612,12.333333,1,4401,S1,21.0


#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.04 Activity 1

Refer to the file `files_for_lesson_and_activities/file1.csv` for this exercise.

1. Load data (`file1.csv`) in a new Jupyter notebook.
2. Write the code to clean the columns 'IC4' and 'IC5' of null values in the dataframe.
3. Use the `head()` to check the new dataframe.

### Debate

1. Is it better to fill null values with the mean or with the median?
2. Can we drop all rows with null values?

In [20]:
df = pd.read_csv('./files_for_lesson_and_activities/file1.csv')

In [21]:
df = df[df['IC4'].isna()==False]

In [22]:
df = df[df['IC5'].isna()==False]

In [23]:
df.head()

Unnamed: 0,CONTROLN,STATE,GENDER,HV1,IC1,IC4,HVP1,IC5,POBC1,POBC2,IC2,IC3,AVGGIFT,TCODE,DOB,DOMAIN,TARGET_D
0,44060,FL,M,AAA896,392,520.0,7,21975,6,16,430.0,466,28.0,1,1901,C2,100.0
1,96093,IL,M,537.00,365,473.0,0,19387,1,89,415.0,410,5.666667,0,0,T2,7.0
2,43333,FL,F,725.00,301,436.0,3,18837,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,21885,NC,M,AAA1095,401,413.0,7,14014,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,190108,FL,F,995.00,252,348.0,0,17991,5,6,280.0,316,6.0,28,0,C2,5.0


### Class debate

1. As a general rule, one can say that if the data in the column has a lot of outliers, then it is preferable to choose median over mean, otherwise you can choose the mean. One advantage when you use these imputation techniques is that you do not change the mean or median of the column. It is important to note that these are not the only means of doing that. There are a lot of other methods that can be employed, which we will take a look at, in later sessions. Sometimes the missing values in the numerical column are simply replaced by a constant, usually 0. It is very case dependent. There are no hard/fixed rules.

2. Generally speaking, it would not be a good idea as you lose all the information from other columns where you do have information available. So you have to be careful when you filter out the rows with null values. You can check the percentage of data that you might lose in doing so and if it makes sense to lose that or not.

### Lesson 2 key concepts

> :clock10: 20 min

- Handling null values in the dataframe

      - Replacing/imputing null values (categorical)

- Lambda expressions

In [24]:
# Replacing null values for categorical variables
data['gender'].value_counts()

F          1954
M          1466
male        126
female      106
Female       75
U            68
Male         33
J            23
feamale      15
A             1
Name: gender, dtype: int64

In [25]:
len(data[data['gender'].isna()==True])  # number of missing values

133

In [26]:
data['gender'] = data['gender'].fillna('F')

In [27]:
len(data[data['gender'].isna()==True]) # now this number is 0

0

In [28]:
# Exporting this processed data to a csv
data.to_csv('merged_clean_ver1.csv') # you can find this file inside files_for_lesson_and_activities folder

> :exclamation: It is important to emphasize on documenting your work and saving copies of data. For now, documentation can be done simply by adding more comment sections in their codes. Saving copies of data is also important so that you don't have to repeat/re-run the code every time from the beginning since we are using many operations from multiple lessons.

In [29]:
# lambda expressions
y = lambda x: x+2
print(y(2))

4


In [30]:
square = lambda x: x*x
square(4)

16

In [31]:
addition = lambda x,y : x+y
addition(1,3)

4

In [32]:
lst = [1,2,3,4,5,6,7,8,10]
new_list = []
for item in lst:
    new_list.append(square(item))
new_list

[1, 4, 9, 16, 25, 36, 49, 64, 100]

In [33]:
new_list = [square(item) for item in lst] # list comprehension
new_list

[1, 4, 9, 16, 25, 36, 49, 64, 100]

In [34]:
new_list = [square(item) for item in lst if item%2==0] # only squares the even numbers in list
new_list

[4, 16, 36, 64, 100]

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.04 Activity 2

Refer to the file `files_for_activities/merged_clean_ver1.csv` for this exercise.

1. Import the data from `merged_clean_ver1.csv` as a dataframe. There would be a column with the sequence of numbers (to the left of column _id_). Drop that column(s).

2. Check the column _state_ for null values. Replace those null values with the state that is represented largest number of times in that column

3. Lambda Expression:

   - Create a simple lambda expression to add three numbers. Take those three numbers as input from the user. (_Since you will accept only numbers as valid inputs, check this [example](https://www.tutorialspoint.com/How-can-we-read-inputs-as-integers-in-Python) to see how to do it._)
   - Define a list as `lst = [1,2,3,4,5,6,7,8,10]`. Write a lambda expression to find the cube of a number. Use that lambda expression to find the cube of every number in the list. Define a _list comprehension_ for this question.

In [35]:
# 1
import warnings  # warnings library, deprecated functions, etc. (!!!)
warnings.filterwarnings('ignore')

In [36]:
df = pd.read_csv('merged_clean_ver1.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic4,hvp1,ic5,pobc1,pobc2,ic2,ic3,avggift,tcode,dob,domain,target_d
0,0,44060,FL,M,1157.329241,392,520.0,7,21975.0,6,16,430.0,466,28.0,1,1901,C2,100.0
1,1,96093,IL,M,537.0,365,473.0,0,19387.0,1,89,415.0,410,5.666667,0,0,T2,7.0
2,2,43333,FL,F,725.0,301,436.0,3,18837.0,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,3,21885,NC,M,1157.329241,401,413.0,7,14014.0,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,4,190108,FL,F,995.0,252,348.0,0,17991.0,5,6,280.0,316,6.0,28,0,C2,5.0


In [37]:
df = df.drop(['Unnamed: 0'], axis=1)
df.head()

Unnamed: 0,id,state,gender,median_home_val,median_household_income,ic4,hvp1,ic5,pobc1,pobc2,ic2,ic3,avggift,tcode,dob,domain,target_d
0,44060,FL,M,1157.329241,392,520.0,7,21975.0,6,16,430.0,466,28.0,1,1901,C2,100.0
1,96093,IL,M,537.0,365,473.0,0,19387.0,1,89,415.0,410,5.666667,0,0,T2,7.0
2,43333,FL,F,725.0,301,436.0,3,18837.0,11,17,340.0,361,4.111111,0,2501,C2,5.0
3,21885,NC,M,1157.329241,401,413.0,7,14014.0,1,74,407.0,399,27.277778,0,2208,T2,38.0
4,190108,FL,F,995.0,252,348.0,0,17991.0,5,6,280.0,316,6.0,28,0,C2,5.0


In [38]:
# 2
df['state'].value_counts()

CA            751
FL            338
TX            293
IL            248
MI            225
NC            168
WA            153
GA            122
OR            117
WI            110
MO            109
IN            108
California    100
CO             90
AZ             88
SC             85
MN             75
KY             67
AL             59
OK             57
LA             56
TN             52
KS             50
NV             46
NM             44
IA             44
AR             42
MS             33
NE             33
Tennessee      29
Cali           24
MT             24
ID             23
SD             22
HI             21
UT             21
Arizona        17
ND             16
WY             12
AK              6
MD              4
AP              3
CT              3
PA              2
AA              2
NJ              2
NY              1
GU              1
AE              1
VT              1
VA              1
WV              1
Name: state, dtype: int64

In [39]:
len(df[df['state'].isna()==True])  # number of missing values

0

In [40]:
df['state'] = df['state'].fillna('CA')

In [41]:
# 3.1
x = int(input('enter X'))
y = int(input('enter Y'))
z = int(input('enter Z'))

enter X 


ValueError: invalid literal for int() with base 10: ''

In [42]:
addition = lambda x,y,z : x+y+z
addition(x,y,z)

NameError: name 'x' is not defined

In [43]:
# 3.2
cube = lambda x: x*x*x

In [44]:
lst = [1,2,3,4,5,6,7,8,10]
new_list = []
for item in lst:
    new_list.append(cube(item))
new_list

[1, 8, 27, 64, 125, 216, 343, 512, 1000]

In [45]:
new_list = [cube(item) for item in lst]
new_list

[1, 8, 27, 64, 125, 216, 343, 512, 1000]

### Lesson 3 key concepts

> :clock10: 20 min

- Map functions

      - Map functions and lambda expressions for data cleaning

- Define a custom function to clean categorical columns

In [46]:
# map functions
data.columns = list(map(lambda x: x.lower(), data.columns))

In [47]:
data['gender'].unique() # check the unique values in the column

array(['M', 'F', 'female', 'Male', 'U', 'J', 'male', 'Female', 'feamale',
       'A'], dtype=object)

In [48]:
data['gender'] = list(map(lambda x: x.upper(), data['gender']))

In [49]:
data['gender'].unique()  # check the unique elements in the column

array(['M', 'F', 'FEMALE', 'MALE', 'U', 'J', 'FEAMALE', 'A'], dtype=object)

In [50]:
def greetings(name):
    print("Hola " + name)

In [51]:
def suma(a: int, b: int):
    return a + b

In [52]:
suma(2, 3)

5

In [53]:
greetings("Josep")

Hola Josep


In [59]:
# Now define a function to clean the column
def clean(x):
    if x in ['M', 'MALE']:
        return 'Male'
    elif x.startswith('F'):
        return 'Female'
    else:
        return 'U'

In [55]:
clean('Fenix')

'Female'

In [56]:
data['gender'] = list(map(clean, data['gender']))

In [57]:
data['gender']

0         Male
1         Male
2       Female
3         Male
4       Female
         ...  
996     Female
997       Male
998     Female
999       Male
1000      Male
Name: gender, Length: 4000, dtype: object

In [58]:
data['gender'].unique()  # To check the results again

# define another method to clean the column "state" in the dataframe. This can be an activity

array(['Male', 'Female', 'U'], dtype=object)

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.04 Activity 3

Try to convert these functions into lambda functions.

```python
def function_1(a,b):
    c=a+b
    return c
```

```python
def function_2(a,b):
    c=0
    for i in range(a):
        c+=2*b
    return c
```

```python
def function_3(a,b):
    c=0
    if a>3:
        c=12
    else:
        c='Too big.'
    return c
```

In [42]:
function_1=lambda a,b : a+b

In [43]:
function_2=lambda a,b : sum([2*b for i in range(a)])

In [44]:
function_3=lambda a,b : 12 if a>3 else 'Too big.'

### Lesson 4 key concepts

> :clock10: 20 min

More data wrangling/cleaning using Python:

- Working with DateTime format
- Using string functions

In [60]:
# Examples of working with datetime format:

file = pd.read_csv('./files_for_lesson_and_activities/df_final_web_data_pt_1.csv')
file.dtypes

Unnamed: 0       int64
client_id        int64
visitor_id      object
visit_id        object
process_step    object
date_time       object
dtype: object

In [61]:
file.head()

Unnamed: 0.1,Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,4/17/17 15:27
1,1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,4/17/17 15:26
2,2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,4/17/17 15:19
3,3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,4/17/17 15:19
4,4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,4/17/17 15:18


In [62]:
file['date_time'] = pd.to_datetime(file['date_time'], errors='coerce')

In [63]:
file['date_time'][0].day

17

In [64]:
file['date_time'][0].month

4

In [65]:
file['date_time'][0].year

2017

In [66]:
file['date_time'][0].isoweekday()  # Returns 1 for Monday and so on

1

In [67]:
file['date_time'][0].time()

datetime.time(15, 27)

In [68]:
file['date_time'][0].isoweekday()

1

In [69]:
file['date_time'][0].isoformat()

'2017-04-17T15:27:00'

In [70]:
file['date_time'][0].strftime(format='%d-%m-%Y')

'17-04-2017'

In [71]:
file['date_time'][0].strftime(format="%A %d. %B %Y")

'Monday 17. April 2017'

In [72]:
import time
from datetime import date

In [73]:
today = date.today()
today

datetime.date(2022, 9, 28)

In [74]:
time.localtime(time.time())

time.struct_time(tm_year=2022, tm_mon=9, tm_mday=28, tm_hour=16, tm_min=49, tm_sec=58, tm_wday=2, tm_yday=271, tm_isdst=1)

In [75]:
time.gmtime(time.time())

time.struct_time(tm_year=2022, tm_mon=9, tm_mday=28, tm_hour=14, tm_min=50, tm_sec=28, tm_wday=2, tm_yday=271, tm_isdst=0)

In [76]:
# Examples of working with string functions
string = " I am learning  data  analysis at Ironhack  . It is  super easy " 

In [77]:
string.lower()

' i am learning  data  analysis at ironhack  . it is  super easy '

In [78]:
string.upper()

' I AM LEARNING  DATA  ANALYSIS AT IRONHACK  . IT IS  SUPER EASY '

In [80]:
'34'.isdigit() # does not work with decimal numbers

False

In [82]:
string.lstrip()

'I am learning  data  analysis at Ironhack  . It is  super easy '

In [83]:
string.rstrip()

' I am learning  data  analysis at Ironhack  . It is  super easy'

In [84]:
string.split()

['I',
 'am',
 'learning',
 'data',
 'analysis',
 'at',
 'Ironhack',
 '.',
 'It',
 'is',
 'super',
 'easy']

In [85]:
string.split('.')

[' I am learning  data  analysis at Ironhack  ', ' It is  super easy ']

In [86]:
string.replace('  ', '')

' I am learningdataanalysis at Ironhack. It issuper easy '

#### :pencil2: Check for Understanding - Class activity/quick quiz

> :clock10: 10 min (+ 10 min Review)

# 1.04 Activity 4

Use the same notebook in which you already loaded `merged_clean_ver1.csv`.

- Create a user-defined method to clean the column `state` in the dataframe.
- Use string functions to standardize the states to uppercase and use the strip function to clean the strings as well

In [69]:
data['state'].unique()

array(['FL', 'IL', 'NC', 'TX', 'CA', 'NV', 'Cali', 'AP', 'KS', 'MI', 'OK',
       'AR', 'IN', 'MT', 'WI', 'MO', 'HI', 'UT', 'GA', 'WA', 'ID', 'CT',
       'AL', 'ND', 'SC', 'IA', 'CO', 'LA', 'OR', 'SD', 'TN', 'NM', 'AZ',
       'MN', 'KY', 'NJ', 'NE', 'California', 'MS', 'NY', 'Arizona', 'WY',
       'Tennessee', 'MD', 'AK', 'VA', 'AE', 'AA', 'PA', 'VT', 'WV', 'GU'],
      dtype=object)

In [88]:
def clean(x):
    x = x.upper()
    x = x.strip()
    if x in ['AZ', 'ARIZONA']:
        return 'AZ'
    elif x in ['CA', 'CALIFORNIA', 'CALI']:
        return 'CA'
    elif x in ['TN', 'TENNESSEE']:
        return 'TN'
    else:
        return x

In [89]:
data['state'] = list(map(clean,data['state']))

In [90]:
data['state'].unique()

array(['FL', 'IL', 'NC', 'TX', 'CA', 'NV', 'AP', 'KS', 'MI', 'OK', 'AR',
       'IN', 'MT', 'WI', 'MO', 'HI', 'UT', 'GA', 'WA', 'ID', 'CT', 'AL',
       'ND', 'SC', 'IA', 'CO', 'LA', 'OR', 'SD', 'TN', 'NM', 'AZ', 'MN',
       'KY', 'NJ', 'NE', 'MS', 'NY', 'WY', 'MD', 'AK', 'VA', 'AE', 'AA',
       'PA', 'VT', 'WV', 'GU'], dtype=object)

### :pencil2: Practice on key concepts - Lab

> :clock10: 30 min

# Lab | Customer Analysis Round 2

For this lab, we will be using the `marketing_customer_analysis.csv` file that you can find in the `files_for_lab` folder. Check out the `files_for_lab/about.md` to get more information if you are using the Online Excel.

**Note**: For the next labs we will be using the same data file. Please save the code, so that you can re-use it later in the labs following this lab.

### Dealing with the data

1. Show the dataframe shape.
2. Standardize header names.
3. Which columns are numerical?
4. Which columns are categorical?
5. Check and deal with `NaN` values.
6. Datetime format - Extract the months from the dataset and store in a separate column. Then filter the data to show only the information for the first quarter , ie. January, February and March. _Hint_: If data from March does not exist, consider only January and February.
7. BONUS: Put all the previously mentioned data transformations into a function.

### Additional Resources

- [Python DateTime](https://www.programiz.com/python-programming/datetime)
- [Python String Methods](https://www.programiz.com/python-programming/methods/string)