# Pandas II - Data Cleaning

_May 13, 2020_

Agenda today:
- Introduction to lambda function
- Introduction to data cleaning in pandas

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Part I. Lambda function
lambda functions are known as anonymous functions in Python. It allows you to write one-line functions that are used together with `map()`, `filter()`.

Syntax of lambda function: `lambda arguments:expressions`. 

In [10]:
# lambda function with one argument
func_1 = lambda x: x+10

In [12]:
# using the function
func_1(10)
#(lambda x: x+ 10)(10)

20

In [13]:
# lambda function with multiple arguments
func_2 = lambda x,y,z: x + y + 10 + z
func_2(2, 3, 5)


20

In [2]:
# exercise: turn the below function into a lambda function
def count_zeros(li):
    """
    return a count of how many zeros are in a list
    """
    count = sum(x == 0 for x in li)
    return count

In [3]:
count_zeros([1,2,4,0,0,0])

3

In [18]:
# solution 1 - hint, use map()
sum(map(lambda x: x == 0, [1, 2, 4, 0, 0, 0]))


3

In [33]:
# solution 2 - hint, use filter()
len(list(filter(lambda x: x == 0, [1, 2, 4, 0, 0, 0])))

3

## Part II. Data Cleaning in Pandas
You might wonder what the usage of lambda functions are - they are incredibly useful when applied to data cleaning in Pandas. You can apply it to columns or the entire dataframe to get results you need. For example, you might want to convert a column with $USD to Euros, or temperature expressed in Celsius to Fehrenheit. You will learn three new functions:

- `Apply()` - on both series and dataframe

- `Applymap()` - only on dataframes

- `Map()` - only on series

In [19]:
# import the dataframe 
df = pd.read_csv('auto-mpg.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [20]:
# examine the first few rows of it 
df.head(5)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [27]:
# check the datatypes of the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg             398 non-null float64
cylinders       398 non-null int64
displacement    398 non-null float64
horsepower      398 non-null object
weight          398 non-null int64
acceleration    398 non-null float64
model year      398 non-null int64
origin          398 non-null int64
car name        398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [34]:
# check the df of columns
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')

In [35]:
# check whether you have missing values
df.isnull().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
model year      False
origin          False
car name        False
dtype: bool

In [36]:
# creating new columns - show the broadcasting property of pandas
df['usable'] = 'Yes'
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes


In [None]:
# check the dataframe
# df['new_colun'] = df.old_column.apply(lambda function)

In [37]:
# time to use lambda and apply! with apply, applymap, and map, you never need to "iterate through the rows"

# create a new column called weight_in_tons, which uses `weight` column and multiply it by 0.0005

# 1 lb = 0.0005

df['weight_in_tons'] = df.weight.apply(lambda x: x * .0005)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable,weight_in_tons
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes,1.752
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes,1.8465
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes,1.718
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes,1.7165
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes,1.7245


In [44]:
# exercise - create a new column called "years old", which determines how old a car is 

# if the car is modeled in`70`, it would be 50 years old 

df['years old'] = df['model year'].apply(lambda x: 2020-int(f'19{a}'))
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable,weight_in_tons,years old
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes,1.752,70
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes,1.8465,70
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes,1.718,70
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes,1.7165,70
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes,1.7245,70


#### What's Next?
- Pandas Groupby functions and aggregation
- Combining multiple dataframes

In [48]:
adult = pd.read_csv('adult.data', header = None)

In [54]:
# where the key?!! 
for c in adult.columns : 
    print (adult[c].value_counts())

36    898
31    888
34    886
23    877
35    876
     ... 
83      6
85      3
88      3
87      1
86      1
Name: 0, Length: 73, dtype: int64
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: 1, dtype: int64
164190    13
203488    13
123011    13
113364    12
121124    12
          ..
284211     1
312881     1
177711     1
179758     1
229376     1
Name: 2, Length: 21648, dtype: int64
 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: 3, dtype: int64
9     10501
10     72