In [8]:
import pandas as pd
import numpy as np
df = pd.read_csv('../Class 6/salaries.csv')

# Introduction to Data Science for Public Policy
## Class 7a: More advanced Pandas - Class 5's assignment
## Thomas Monk

First, let's go through Class 5's assignment - but this time with Pandas.

Open it again with me, and let's work through it together.

### Get columns - let's see what's there.
Provides you with a list of columns (this is just a series, note the data type that it gives you.)

In [9]:
df.columns

Index(['Name', 'Job Titles', 'Department', 'Full or Part-Time',
       'Salary or Hourly', 'Typical Hours', 'Annual Salary', 'Hourly Rate'],
      dtype='object')

In [16]:
df['Typical Hours'].describe() # some useful descriptive statistics.

count    8022.000000
mean       34.507604
std         9.252077
min        10.000000
25%        20.000000
50%        40.000000
75%        40.000000
max        40.000000
Name: Typical Hours, dtype: float64

Question 1: How many employees does the city have?

### Null/missing values
#### Q2: How many full time employees are there?
Something we often came across in Stata. Ideas?

In [143]:
df['Typical Hours'].isnull() #How do we get the proportion?

0         True
1         True
2         True
3         True
4        False
         ...  
33178     True
33179     True
33180     True
33181     True
33182     True
Name: Typical Hours, Length: 33183, dtype: bool

Instead of summing, it can be more useful to take the mean, which
corresponds to the proportion of missing values:

In [18]:
df['Typical Hours'].isnull().mean()

0.7582497061748485

**Missing values**. We can also run this on a Data Frame - we now just have a DF output.

In [19]:
df.isnull()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,False,False,False,False,False,True,False,True
1,False,False,False,False,False,True,False,True
2,False,False,False,False,False,True,False,True
3,False,False,False,False,False,True,False,True
4,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...
33178,False,False,False,False,False,True,False,True
33179,False,False,False,False,False,True,False,True
33180,False,False,False,False,False,True,False,True
33181,False,False,False,False,False,True,False,True


In [20]:
df.isnull().mean() # and obtain proportions for the entire DF.

Name                 0.00000
Job Titles           0.00000
Department           0.00000
Full or Part-Time    0.00000
Salary or Hourly     0.00000
Typical Hours        0.75825
Annual Salary        0.24175
Hourly Rate          0.75825
dtype: float64

**Imputation** We can fill missing values with fillna. (There's a more complicated imputation method worth looking at)

In [141]:
df['Typical Hours'].fillna(40) #All missing hours are now full time work weeks.

0        40.0
1        40.0
2        40.0
3        40.0
4        20.0
         ... 
33178    40.0
33179    40.0
33180    40.0
33181    40.0
33182    40.0
Name: Typical Hours, Length: 33183, dtype: float64

Would this assign itself to the dataframe?

No - we'd either have to assign it with equals (=) or we can use the inplace argument:

`fillna(inplace=True)`. We can pass this to many methods to skip the assignment if it's easier!

**Counting values** Given a series, we can get a new series which is a histogram of the
original using the value_counts function:

In [24]:
df.Department.value_counts().head(3) # Q4: How many employees are in the police department?
# Q6: How big is the largest department? Which one is it?

POLICE           13414
FIRE              4641
STREETS & SAN     2198
Name: Department, dtype: int64

In [28]:
# Q2: How many full time employees are there?
# Q3: How many part time employees are there?
df['Full or Part-Time'].value_counts() # Full or Part time

F    31090
P     2093
Name: Full or Part-Time, dtype: int64

In [30]:
# Q5: How many detectives? - how could we find this out?

In [33]:
df[df.Department == 'POLICE']['Job Titles'].value_counts().head(3)

POLICE OFFICER                            9520
SERGEANT                                  1202
POLICE OFFICER (ASSIGNED AS DETECTIVE)     989
Name: Job Titles, dtype: int64

## Apply
Think about the following question:

Q7: How common is the most common last name? Which is it?

This was tough before. Why?

Remember how names are stored in the database. What do we want to do with them?

In [35]:
df['Name'].head()

0      AARON,  JEFFERY M
1        AARON,  KARINA 
2    AARON,  KIMBERLEI R
3    ABAD JR,  VICENTE M
4      ABASCAL,  REECE E
Name: Name, dtype: object

We just want the first name. Pandas has a super powerful tool which allows us to write a function and apply it to each row of our dataset.

**Apply example**

In [87]:
display(df['Typical Hours'].head(5))
df['Typical Hours'].apply(np.sqrt).head(5) # What's the function I've used?

0     NaN
1     NaN
2     NaN
3     NaN
4    20.0
Name: Typical Hours, dtype: float64

0         NaN
1         NaN
2         NaN
3         NaN
4    4.472136
Name: Typical Hours, dtype: float64

**More interesting example**
What do we need our function to do with this input? AARON,  JEFFERY M.

Let's pause and write this together.

In [108]:
def get_first_name(name):
    first_middle = name.split(', ')[1]
    first = first_middle.strip().split(' ')[0]
    return first
display(df.Name.tail(5))
df.Name.tail(5).apply(get_first_name)

33178    ZYLINSKA,  KATARZYNA 
33179       ZYMANTAS,  LAURA C
33180        ZYMANTAS,  MARK E
33181      ZYRKOWSKI,  CARLO E
33182     ZYSKOWSKI,  DARIUSZ 
Name: Name, dtype: object

33178    KATARZYNA
33179        LAURA
33180         MARK
33181        CARLO
33182      DARIUSZ
Name: Name, dtype: object

We want to know the most common first name. What should we do?

In [113]:
df['First Name'] = df.Name.apply(get_first_name) #Make a new column called First Name

In [114]:
df['First Name'].value_counts().head(1) # What's the answer?

MICHAEL    1151
Name: First Name, dtype: int64

**Working with strings** Another way to work with string columns is through the .str
attribute. For example:

In [118]:
(df['Job Titles'].str.find('DETECTIVE') >= 0).head(5)

0    False
1     True
2    False
3    False
4    False
Name: Job Titles, dtype: bool

How could we find how many detectives?

In [117]:
(df['Job Titles'].str.find('DETECTIVE') > 0).sum()

989

**Salaries** We also needed to change the salaries from a string $19000 to a float (19000.0). How?

In [124]:
df['Num Salary'] = df['Annual Salary'].str[1:].astype(float)
df['Num Salary'].head(5) #What's the mean?

0    101442.0
1     94122.0
2    101592.0
3    110064.0
4         NaN
Name: Num Salary, dtype: float64

In [126]:
df['Num Salary'].describe()

count     25161.000000
mean      86786.999790
std       21041.354602
min        7200.000000
25%       76266.000000
50%       90024.000000
75%       96060.000000
max      300000.000000
Name: Num Salary, dtype: float64

And let's sort to find the top.

In [127]:
df.sort_values('Num Salary',ascending=False)

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate,First Name,Num Salary
8439,"EVANS, GINGER S",COMMISSIONER OF AVIATION,AVIATION,F,Salary,,$300000.00,,GINGER,300000.0
14221,"JOHNSON, EDDIE T",SUPERINTENDENT OF POLICE,POLICE,F,Salary,,$260004.00,,EDDIE,260004.0
8198,"EMANUEL, RAHM",MAYOR,MAYOR'S OFFICE,F,Salary,,$216210.00,,RAHM,216210.0
26424,"SANTIAGO, JOSE A",FIRE COMMISSIONER,FIRE,F,Salary,,$202728.00,,JOSE,202728.0
9226,"FORD II, RICHARD C",FIRST DEPUTY FIRE COMMISSIONER,FIRE,F,Salary,,$197736.00,,RICHARD,197736.0
...,...,...,...,...,...,...,...,...,...,...
33164,"ZUREK, FRANCIS",ELECTRICAL MECHANIC,OEMC,F,Hourly,40.0,,$46.10,FRANCIS,
33168,"ZWARYCZ MANN, IRENE A",CROSSING GUARD,OEMC,P,Hourly,20.0,,$17.68,IRENE,
33169,"ZWARYCZ, THOMAS J",POOL MOTOR TRUCK DRIVER,WATER MGMNT,F,Hourly,40.0,,$35.60,THOMAS,
33174,"ZYGADLO, JOHN P",MACHINIST (AUTOMOTIVE),GENERAL SERVICES,F,Hourly,40.0,,$46.35,JOHN,


**Cutting data** We can ask Pandas to group data for us, and set it directly as a variable to work with.

In [133]:
pd.cut(df['Num Salary'],[0, 100000, 200000, 300000]) # Note I passed a list of groups

0        (100000.0, 200000.0]
1             (0.0, 100000.0]
2        (100000.0, 200000.0]
3        (100000.0, 200000.0]
4                         NaN
                 ...         
33178         (0.0, 100000.0]
33179         (0.0, 100000.0]
33180         (0.0, 100000.0]
33181         (0.0, 100000.0]
33182    (100000.0, 200000.0]
Name: Num Salary, Length: 33183, dtype: category
Categories (3, interval[int64, right]): [(0, 100000] < (100000, 200000] < (200000, 300000]]

In [137]:
pd.cut(df['Num Salary'], [0, 100000, 200000, 300000], labels=['low', 'medium', 'high']) # Or with labels as well!

0        medium
1           low
2        medium
3        medium
4           NaN
          ...  
33178       low
33179       low
33180       low
33181       low
33182    medium
Name: Num Salary, Length: 33183, dtype: category
Categories (3, object): ['low' < 'medium' < 'high']