# Pandas II - Data Cleaning

_October 29, 2020_

Agenda today:
- Introduction to lambda function
- Introduction to data cleaning in pandas
- Combining DataFrames
- Optional Exercises

In [44]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Part I. Lambda function
lambda functions are known as anonymous functions in Python. It allows you to write one-line functions that are used together with `map()`, `filter()`.

Syntax of lambda function: `lambda arguments:expressions`. 

In [45]:
# lambda function with one argument
lambda x:x+10
plus_ten = lambda x:x+10
# add 10 to any number
plus_ten(11)

# (lambda x:x+10)(10)

21

In [46]:
(lambda x:x+10)(5)

15

In [47]:
# lambda function with multiple arguments
(lambda x,y,z:x+y+z)(1,2,4)

7

In [48]:
# chain it with conditionals *must inc. 'else'
(lambda x:x+10 if x>10 else x)(25)

35

In [49]:
#use it with map and filter

#map(function, collection)
# every eliment in the function is being
# applied with the function

map(lambda x:x+10, [1,2,3])
list(map(lambda x:x+10, [1,2,3]))
# must pass in list funct.
# map funct. returns list object

[11, 12, 13]

In [50]:
#filter(function, collection)

# return a filter object that can be cast
# as a list with only thing that match params

# applies said funct to every elem in
# collect and only returns the items that
# satisfy certain cond's

'''list(list(filter(lambda x:x+10 if x > 5 else x, [1,6,4]))(lambda x:x+10 if x > 5 else x, [1,6,4]))
Does Not Work As Intended, Try to Fix???'''

list(filter(lambda x:x>5, [1,6,4]))

[6]

In [51]:
# exercise: turn the below function into a lambda function
def count_zeros(li):
    """
    return a count of how many zeros are in a list
    """
    count = sum(x == 0 for x in li)
    return count

sum(map(lambda x:1 if x == 0 else 0, [1,0,0,1,0]))

3

In [52]:
list(filter(lambda x: x==0, [1,0,0,1,0]))

[0, 0, 0]

In [53]:
'''
len(filter(lambda x: x==0, [1,0,0,1,0]))
'''

# ^^^^This does not work...
# also, sum(true) does not work
# but sum([True works])
sum([True])

1

In [54]:
from functools import reduce

def count_zeros(li):
    print(reduce(lambda a,b : a + b, li))
#     b=

#     print(reduce(lambda a,b : a if a > b else b,lis)) 

count_zeros([0,1,2,3,2,1,0,1,2,1,0,1,0])


14


## Part II. Data Cleaning in Pandas
You might wonder what the usage of lambda functions are - they are incredibly useful when applied to data cleaning in Pandas. You can apply it to columns or the entire dataframe to get results you need. For example, you might want to convert a column with $USD to Euros, or temperature expressed in Celsius to Fehrenheit. You will learn three new functions:

- `Apply()` - on both series and dataframe

- `Applymap()` - only on dataframes

- `Map()` - only on series

In [55]:
# Diff. btwn series and Dframe
# Series is 1d DF is 2d

In [56]:
# import the dataframe 
df = pd.read_csv('auto-mpg.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [57]:
# examine the first few rows of it 
# Why is hp an obj?
# We want to turn hp into a str but first
# investigate why hp is being interpreted
# as a str


In [60]:
# check the datatypes of the df
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower       object
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

In [62]:
df.horsepower

0      130
1      165
2      150
3      150
4      140
      ... 
393     86
394     52
395     84
396     79
397     82
Name: horsepower, Length: 398, dtype: object

In [65]:
df.horsepower.value_counts().sort_values(ascending=False)

150    22
90     20
88     19
110    18
100    17
       ..
208     1
94      1
61      1
91      1
93      1
Name: horsepower, Length: 94, dtype: int64

In [66]:
'''Still see nothing suspicous...
What if we just sort the value s'''


df.horsepower.sort_values()


'''What are these "?"s'''

133    100
98     100
256    100
107    100
334    100
      ... 
126      ?
374      ?
354      ?
32       ?
336      ?
Name: horsepower, Length: 398, dtype: object

In [None]:
# entire series is being coerced into str
# either remove them or set them to 0

In [73]:
'''setting DF to be DF by which hp does
not equal "?"'''

df = df[df.horsepower != '?']

'''convert it into an int'''
df['horsepower'] = df.horsepower.astype('int')

  result = method(y)


In [77]:
df[(df.horsepower < 150) & (df.weight > 3000)]['car name']

0              chevrolet chevelle malibu
4                            ford torino
34             plymouth satellite custom
35             chevrolet chevelle malibu
36                       ford torino 500
                     ...                
363                        buick century
364                oldsmobile cutlass ls
365                      ford granada gl
366               chrysler lebaron salon
387    oldsmobile cutlass ciera (diesel)
Name: car name, Length: 100, dtype: object

In [None]:
'''.astype() method allows you to convert
the datatype of a specific column and the
syntax is to subset the dataframe such that
the DF becomes all of the entries in which
condition is satisfied and then I can perform
my query'''

In [78]:
df = df.replace('?',0)

'''this method replaces ? marks
with zero rather than throwing
them out'''

In [79]:
'''you can also replace missing values
with the median or the meana'''

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
...,...,...,...,...,...,...,...,...,...
393,27.0,4,140.0,86,2790,15.6,82,1,ford mustang gl
394,44.0,4,97.0,52,2130,24.6,82,2,vw pickup
395,32.0,4,135.0,84,2295,11.6,82,1,dodge rampage
396,28.0,4,120.0,79,2625,18.6,82,1,ford ranger


In [81]:
# check the datatypes of the df
# picking up from earlier
df.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower        int64
weight            int64
acceleration    float64
model year        int64
origin            int64
car name         object
dtype: object

In [82]:
# check whether you have missing values
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
car name        0
dtype: int64

In [83]:
# check whether you have missing values
df.isna().sum().any()

False

In [84]:
# creating new columns - broadcasting 
df['usable?'] = 'Yes'

In [85]:
# check the dataframe
df.head()

'''^^^^This works as broadcasting because
we are casting 'Yes' to every single row
so that's why it's broadcasting'''

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable?
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes


In [None]:
'''create new colum to convert pounds
to tons. Either use apply or map or
broadcasting'''

In [8]:
# time to use lambda and apply! with apply, applymap, and map, you never need to "iterate through the rows"

# create a function that takes in the weight as lbs, and return weight in tons 

# 1 lb = 0.0005
df['weight_in_tons'] = df['weight'].apply(lambda x:x*0.0005)
df.head()

'''this is how we use apply in the
context of pandas... we take the column
in which we want to apply function to,
pass it the apply method and, inside the
apply method, pass in the function I want
to apply each entry to, which is multiply
it by 0.0005'''

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,weight_in_tons
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,1.752
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,1.8465
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,1.718
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,1.7165
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,1.7245


In [10]:
#broadcasting
df['weight_in_tons'] = df['weight'] * 0.0005
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,weight_in_tons
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,1.752
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,1.8465
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,1.718
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,1.7165
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,1.7245


In [None]:
''' ^^^ as we can see, weight in tons using the method as
broacasting has the exact same results as
weight in tons using apply

broadcasting is pref!!! much faster than
apply or map, but apply or map are still
pref to explicitly iterating thru rows
using a for-loop. NO EXPLICIT ITERATIONS...
USE METHODS OR BROADCASTING'''

In [12]:
# exercise - create a new column called
# "years old", which determines how old a
# car is 
df['years old'] = 120 - df['model year']
df.head()
'''can be solved w/ broadcasting or apply
either way just subtract 120'''




Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,weight_in_tons,years old
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,1.752,50
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,1.8465,50
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,1.718,50
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,1.7165,50
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,1.7245,50


## Part III. Combining DataFrames in Pandas
There are two methods in Pandas that allow us to combine our DataFrames:

    - df.merge() - allow us to match dataframes on either indices or columns
    - df.concat() - allow us to concatenate two dataframes vertically or horizontally 


In [86]:
#use df.merge not df.join


In [87]:
# create some toy dataframes 
small_grades = pd.DataFrame({"students":["Sandra","Billy","Alan"],
                          "projects":[1,2,1],
                          "grades":np.random.randint(80,100,3)})
small_quiz = pd.DataFrame({"students":["Alan","Steven","Davida"],
                            "quiz_score":np.random.randint(0,10,3)})

In [88]:
print(small_grades)
print(small_quiz)

  students  projects  grades
0   Sandra         1      93
1    Billy         2      84
2     Alan         1      93
  students  quiz_score
0     Alan           4
1   Steven           4
2   Davida           6


In [89]:
### df.concat 
combined = pd.concat(
    [small_grades, small_quiz], axis = 0
)

'''axis=1 combines on columns'''

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  This is separate from the ipykernel package so we can avoid doing imports until


'axis=1 combines on columns'

In [None]:
combined

In [None]:
'''data science interviews ask about joins'''

<img src = 'sql-joins.png' width = 400>

Based on the diagram above, what are the differences of different types of merge?

In [94]:
### df.merge

# inner merge

#small_grades.merge(small_quiz, on = 'students')

small_grades.merge(small_quiz, on = 'students')

Unnamed: 0,students,projects,grades,quiz_score
0,Alan,1,93,4


In [92]:
# outer merge
small_grades.merge(small_quiz, how = 'outer', on = 'students')
#small_grades.merge(small_quiz, how = 'outer', on = 'students')

Unnamed: 0,students,projects,grades,quiz_score
0,Sandra,1.0,93.0,
1,Billy,2.0,84.0,
2,Alan,1.0,93.0,4.0
3,Steven,,,4.0
4,Davida,,,6.0


In [90]:
# right merge
small_grades.merge(small_quiz, how = 'right', on = 'students')

Unnamed: 0,students,projects,grades,quiz_score
0,Alan,1.0,93.0,4
1,Steven,,,4
2,Davida,,,6


In [96]:
''' ^^^ This works because 'students' is
a oolumn in both DF's. If it were not, we
would write:
small_grades.merge(small_quiz, how = 'right', left_on = 'students')
'''

# We will be using joins/merges in our project

" ^^^ This works because 'students' is\na oolumn in both DF's. If it were not, we\nwould write:\nsmall_grades.merge(small_quiz, how = 'right', left_on = 'students')\n"

### Data Cleaning - level up with the adult dataset

In [None]:
'''Dataset called adults but really
about income, predicting if person makes
less than or more than 50k

show the dist of educ lvl and how much ppl
made more/less than 50k depending on educ'''

Dataset documentation:
- age: continuous.
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- fnlwgt: continuous.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- education-num: continuous.
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- sex: Female, Male.
- capital-gain: continuous.
- capital-loss: continuous.
- hours-per-week: continuous.
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.



In [78]:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
adults = pd.read_csv(url,header = None)

In [79]:
adults.a(lambda x: x)

AttributeError: 'DataFrame' object has no attribute 'a'

In [80]:
# how do I want to include the headers and perform each of the exercies below (optional!)

In [82]:
# Check the first few rows 
# adults[adults['sex'] == 'Female']

In [94]:
# add the columns to dataset
adults.columns = ['age', 'workclass', 'fnlwgt', 'education',
           'education-num', 'marital_status','occupation',
           'relationship', 'race', 'sex', 'capital_gain',
           'capital_loss', 'hours_per_week', 'native_country',
           'income']

adults.sample(30)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
11959,59,Private,261232,11th,7,Divorced,Farming-fishing,Not-in-family,White,Male,0,0,40,United-States,<=50K
11588,45,Private,117556,Some-college,10,Divorced,Adm-clerical,Unmarried,Black,Female,0,0,32,United-States,<=50K
29325,20,?,133515,Some-college,10,Never-married,?,Own-child,White,Female,0,0,15,France,<=50K
9395,31,Private,210008,HS-grad,9,Never-married,Sales,Own-child,White,Female,0,0,40,United-States,<=50K
1126,35,Private,89508,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,50,United-States,>50K
31614,49,Private,237920,Doctorate,16,Married-civ-spouse,Sales,Husband,White,Male,0,0,60,United-States,<=50K
27782,41,Private,222504,Prof-school,15,Divorced,Prof-specialty,Unmarried,White,Female,0,0,38,United-States,<=50K
22325,61,Self-emp-not-inc,315977,HS-grad,9,Married-civ-spouse,Transport-moving,Husband,Black,Male,0,0,40,United-States,<=50K
1989,30,Private,181992,Some-college,10,Never-married,Sales,Not-in-family,Black,Female,0,0,35,United-States,<=50K
14569,31,Private,200117,HS-grad,9,Divorced,Adm-clerical,Not-in-family,White,Male,0,0,40,United-States,<=50K


In [84]:
# check the info of dataset
adults.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
income            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [85]:
# check missing values
adults['?']

KeyError: '?'

In [105]:
# perform some queries - find the total num female & bachelors
# adults.groupby(['sex']).count()
# adults.groupby(['sex', 'marital_status'])
adults.groupby(['marital_status'])['sex'].value_counts()

marital_status          sex    
 Divorced                Female     2672
                         Male       1771
 Married-AF-spouse       Female       14
                         Male          9
 Married-civ-spouse      Male      13319
                         Female     1657
 Married-spouse-absent   Male        213
                         Female      205
 Never-married           Male       5916
                         Female     4767
 Separated               Female      631
                         Male        394
 Widowed                 Female      825
                         Male        168
Name: sex, dtype: int64

In [None]:
# seems like we have data anolmaly, find out what that is and fix it 
# hint - tons of entries contain white space, remove it!

In [None]:
# subsetting multiple cols


In [98]:
# create a column called income_binary, 1 if >50k and 0 otherwise
adults['income_binary'] == adults['income'].map(lambda x:x=='>50k')

KeyError: 'income_binary'

In [None]:
# get some descriptive statistics of the income distribution 