# Solutions

1. [DataFrame Attributes and Methods](#1.-DataFrame-Attributes-and-Methods)
1. [DataFrame Statistical Methods](#2.-DataFrame-Statistical-Methods)
1. [DataFrame Missing Value Methods](#3.-DataFrame-Missing-Value-Methods)
1. [DataFrame Sorting, Ranking, and Uniqueness](#4.-DataFrame-Sorting,-Ranking,-and-Uniqueness)
1. [DataFrame Structure Methods](#5.-DataFrame-Structure-Methods)
1. [DataFrame Methods More](#6.-DataFrame-Methods-More)
1. [Assinging Subsets of Data](#7.-Assigning-Subsets-of-Data)

In [1]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')

## 1. DataFrame Attributes and Methods

### Exercise 1
<span  style="color:green; font-size:16px">Select only the float columns from the `college` DataFrame. How many are there?</span>

In [2]:
college.select_dtypes('float').shape

(7535, 20)

There are 20 float columns

### Exercise 2
<span  style="color:green; font-size:16px">When you call the `info` method on a DataFrame, one of the very last items that gets outputted is the count of columns for each data type. Can you think of a different combination of pandas operations that would return this as a Series.</span>

In [3]:
college.dtypes.value_counts()

float64    20
object      4
int64       2
dtype: int64

## 2. DataFrame Statistical Methods

In [4]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title')
cols = ['actor1_fb', 'actor2_fb', 'actor3_fb']
actor_fb = movie[cols]
actor_fb.head(3)

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Avatar,1000.0,936.0,855.0
Pirates of the Caribbean: At World's End,40000.0,5000.0,1000.0
Spectre,11000.0,393.0,161.0


### Exercise 1
<span  style="color:green; font-size:16px">Calculate the mean of each actor Facebook like column. Which actor (1, 2, or 3) has the highest mean?</span>

actor 1 has the highest mean.

In [5]:
actor_fb.mean()

actor1_fb    6494.488491
actor2_fb    1621.923516
actor3_fb     631.276313
dtype: float64

### Exercise 2

<span  style="color:green; font-size:16px">The result of exercise 1 is a Series of three values. Can you call a method on this Series to choose the column name with the highest mean Facebook likes.</span>

In [6]:
actor_fb.mean().idxmax()

'actor1_fb'

### Exercise 3

<span  style="color:green; font-size:16px">Calculate the total Facebook likes of all three actors for each movie</span>

In [7]:
actor_fb.sum(axis=1).head()

title
Avatar                                         2791.0
Pirates of the Caribbean: At World's End      46000.0
Spectre                                       11554.0
The Dark Knight Rises                         73000.0
Star Wars: Episode VII - The Force Awakens      143.0
dtype: float64

### Exercise 4
<span  style="color:green; font-size:16px">What percentage of movies have more than 10,000 total actor FB likes?</span>

About 30%

In [8]:
(actor_fb.sum(axis='columns') > 10000).mean()

0.2982099267697315

### Exercise 5

<span  style="color:green; font-size:16px">Find the median gross revenue in millions of dollars for the movies that have more than 10,000 total actor FB likes. Do the same for movies with 10,000 or less total actor FB likes.</span>

In [9]:
filt = actor_fb.sum(axis='columns') > 10000
movie.loc[filt, 'gross'].median() / 1e6

42.3919155

In [10]:
movie.loc[~filt, 'gross'].median() / 1e6

16.8157525

### Exercise 6

<span  style="color:green; font-size:16px">From exercise 5, it appears that movies with more than 10,000 total actor FB likes gross 2.5 times as much. This may be due to the fact that newer movies have more actors that are recognized by FB users. Find the median year produced for both groups.</span>

In [11]:
movie.loc[filt, 'year'].median()

2006.0

In [12]:
movie.loc[~filt, 'year'].median()

2005.0

### Exercise 7

<span  style="color:green; font-size:16px">For each movie made in the year 2016, what is the median of the total actor FB likes?</span>

In [13]:
filt = movie['year'] == 2016
actor_fb[filt].sum(axis=1).median()

3571.5

If the above is too complex, here it is one step at a time.

In [14]:
actor_fb_2016 = actor_fb[filt]
actor_fb_2016.head()

Unnamed: 0_level_0,actor1_fb,actor2_fb,actor3_fb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Batman v Superman: Dawn of Justice,15000.0,4000.0,2000.0
Captain America: Civil War,21000.0,19000.0,11000.0
Star Trek Beyond,998.0,119.0,105.0
The Legend of Tarzan,11000.0,10000.0,103.0
X-Men: Apocalypse,34000.0,13000.0,1000.0


In [15]:
actor_fb_2016_total = actor_fb_2016.sum(axis=1)
actor_fb_2016_total.head()

title
Batman v Superman: Dawn of Justice    21000.0
Captain America: Civil War            51000.0
Star Trek Beyond                       1222.0
The Legend of Tarzan                  21103.0
X-Men: Apocalypse                     48000.0
dtype: float64

In [16]:
actor_fb_2016_total.median()

3571.5

### Exercise 8

<span  style="color:green; font-size:16px">Write a function that has a single parameter, `year`. Have it return the median of the total actor FB likes for the given year. Test your function with the year 2016 and verify the result with Exercise 6.</span>

In [17]:
def median_fb_likes(year):
    filt = movie['year'] == year
    return actor_fb.loc[filt].sum(axis='columns').median()

In [18]:
median_fb_likes(2016)

3571.5

### Exercise 9

<span  style="color:green; font-size:16px">Write a loop to print out the year and median total actor FB likes for that year from 1990 to 2016</span>

In [19]:
for year in range(1990, 2017):
    print(year, median_fb_likes(year))

1990 2017.0
1991 2436.0
1992 2147.5
1993 2018.0
1994 2368.5
1995 2612.0
1996 2692.5
1997 1964.0
1998 2482.0
1999 2595.0
2000 2378.0
2001 2424.0
2002 2146.0
2003 2019.0
2004 2298.0
2005 2072.0
2006 2359.0
2007 2002.5
2008 2400.0
2009 2145.0
2010 2411.0
2011 2818.5
2012 2426.0
2013 2420.0
2014 2084.0
2015 2063.0
2016 3571.5


Use the college dataset with the institution name as the index for the remaining exercises.

In [20]:
college = pd.read_csv('../data/college.csv', index_col='instnm')

### Exercise 10

<span  style="color:green; font-size:16px">Using the **college** dataset, find the number of non-missing values in each column and again for each row.</span>

non-missing for each column

In [21]:
college.count()

city                  7535
stabbr                7535
hbcu                  7164
menonly               7164
womenonly             7164
relaffil              7535
satvrmid              1185
satmtmid              1196
distanceonly          7164
ugds                  6874
ugds_white            6874
ugds_black            6874
ugds_hisp             6874
ugds_asian            6874
ugds_aian             6874
ugds_nhpi             6874
ugds_2mor             6874
ugds_nra              6874
ugds_unkn             6874
pptug_ef              6853
curroper              7535
pctpell               6849
pctfloan              6849
ug25abv               6718
md_earn_wne_p10       6413
grad_debt_mdn_supp    7503
dtype: int64

non-missing count for the rows

In [22]:
college.count(axis='columns').head()

instnm
Alabama A & M University               26
University of Alabama at Birmingham    26
Amridge University                     24
University of Alabama in Huntsville    26
Alabama State University               26
dtype: int64

### Exercise 11

<span  style="color:green; font-size:16px">What is the average number of non-missing values for each row?</span>

In [23]:
college.count(axis=1).mean()

22.70763105507631

### Exercise 12

<span  style="color:green; font-size:16px">The `UGDS` column of the college dataset contains the total undergraduate population. What is the least number of colleges it would take to have have a total of more than 5 million students.</span>

In [24]:
ugds_cumsum = college['ugds'].sort_values(ascending=False).cumsum()
ugds_cumsum.head(10)

instnm
University of Phoenix-Arizona             151558.0
Ivy Tech Community College                229215.0
Miami Dade College                        290685.0
Lone Star College System                  350605.0
Houston Community College                 408689.0
University of Central Florida             460969.0
Liberty University                        510309.0
Texas A & M University-College Station    557250.0
American Public University System         602174.0
Ashford University                        646918.0
Name: ugds, dtype: float64

In [25]:
(ugds_cumsum < 5000000).sum() + 1

185

It takes the top 185 colleges (by population) to total more than 5 million students. Let's verify the results:

In [26]:
ugds_sort = college['ugds'].sort_values(ascending=False)

In [27]:
ugds_sort.iloc[:184].sum()

4989478.0

In [28]:
ugds_sort.iloc[:185].sum()

5007289.0

### Exercise 13

<span  style="color:green; font-size:16px">Call the `describe` method, but make it work only for the string columns.</span>

In [29]:
college.describe(include='object')

Unnamed: 0,city,stabbr,md_earn_wne_p10,grad_debt_mdn_supp
count,7535,7535,6413,7503
unique,2514,59,598,2038
top,New York,CA,PrivacySuppressed,PrivacySuppressed
freq,87,773,822,1510


### Exercise 14

<span  style="color:green; font-size:16px">Call the `max` method, but only return columns that are numeric.</span>

In [30]:
college.max(numeric_only=True)

hbcu                 1.0000
menonly              1.0000
womenonly            1.0000
relaffil             1.0000
satvrmid           765.0000
satmtmid           785.0000
distanceonly         1.0000
ugds            151558.0000
ugds_white           1.0000
ugds_black           1.0000
ugds_hisp            1.0000
ugds_asian           0.9727
ugds_aian            1.0000
ugds_nhpi            0.9983
ugds_2mor            0.5333
ugds_nra             0.9286
ugds_unkn            0.9027
pptug_ef             1.0000
curroper             1.0000
pctpell              1.0000
pctfloan             1.0000
ug25abv              1.0000
dtype: float64

## 3. DataFrame Missing Value Methods

In [31]:
import pandas as pd
college = pd.read_csv('../data/college.csv', index_col='instnm')

### Exercise 1

<span  style="color:green; font-size:16px">Find the number of missing values for each row.</span>

In [32]:
college.isna().sum(axis='columns').head(10)

instnm
Alabama A & M University               0
University of Alabama at Birmingham    0
Amridge University                     2
University of Alabama in Huntsville    0
Alabama State University               0
The University of Alabama              0
Central Alabama Community College      2
Athens State University                2
Auburn University at Montgomery        0
Auburn University                      0
dtype: int64

### Exercise 2

<span  style="color:green; font-size:16px">What percentage of rows have more than 5 missing values?</span>

In [33]:
(college.isna().sum(axis='columns') > 5).mean()

0.09011280690112806

### Exercise 3

<span  style="color:green; font-size:16px">How many total missing values are there in the entire DataFrame?</span>

In [34]:
college.isna().sum().sum()

24808

### Exercise 4

<span  style="color:green; font-size:16px">How many total non-missing values are there in the entire DataFrame?</span>

In [35]:
college.count().sum()

171102

### Exercise 5

<span  style="color:green; font-size:16px">How many rows will be dropped when the `dropna` method is called with its defaults. Calculate this number without calling the `dropna` method.</span>

In [36]:
# calculate the number of rows with at least one missing value
(college.isna().sum(axis=1) > 0).sum()

6364

### Exercise 6

<span  style="color:green; font-size:16px">Verify the result from exercise 5 by calling the `dropna` method.</span>

In [37]:
len(college) - len(college.dropna())

6364

### Exercise 7

<span  style="color:green; font-size:16px">Drop all the rows that are missing the `ugds` column.</span>

In [38]:
college.dropna(subset=['ugds']).head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,,,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


### Exercise 8

<span style="color:green; font-size:16px">Drop all columns that have more than 5% of their values missing.</span>

In [39]:
thresh = int(.95 * len(college))
thresh

7158

In [40]:
college.dropna(axis=1, thresh=thresh).shape

(7535, 9)

17 columns were dropped.

In [41]:
college.shape

(7535, 26)

### Exercise 9

<span style="color:green; font-size:16px">Fill in the missing values with the maximum value of each column.</span>

In [42]:
max_vals = college.max()
max_vals.head()

city         Zanesville
stabbr               WY
hbcu                  1
menonly               1
womenonly             1
dtype: object

In [43]:
college.fillna(max_vals).head(3)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,...,ugds_2mor,ugds_nra,ugds_unkn,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,...,0.0,0.0059,0.0138,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,...,0.0368,0.0179,0.01,0.2607,1,0.346,0.5214,0.2422,39700,21941.5
Amridge University,Montgomery,AL,0.0,0.0,0.0,1,765.0,785.0,1.0,291.0,...,0.0,0.0,0.2715,0.4536,1,0.6801,0.7795,0.854,40100,23370.0


## 4. DataFrame Sorting, Ranking, and Uniqueness

In [44]:
emp = pd.read_csv('../data/employee.csv')

### Exercise 1

<span  style="color:green; font-size:16px">Sort department, race, sex ascending along with salary descending.</span>

In [45]:
emp.sort_values(['dept', 'race', 'sex', 'salary'], ascending=[True, True, True, False]).head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
19946,Fire,"PHYSICIAN,MD",2015-06-22,342784.0,Female,Asian
514,Fire,ASSISTANT EMS PHYSICIAN DIRECTOR,2017-08-28,141669.0,Female,Asian
8642,Fire,STAFF PSYCHOLOGIST,2014-12-22,103805.0,Female,Asian
7199,Fire,SENIOR STAFF ANALYST,2003-01-14,97850.0,Female,Asian
430,Fire,ADMINISTRATIVE ASSISTANT,1985-08-19,55432.0,Female,Asian


### Exercise 2

<span  style="color:green; font-size:16px">How many unique combinations of department and title exist?</span>

In [46]:
len(emp.drop_duplicates(subset=['dept', 'title']))

1312

### Exercise 3

<span  style="color:green; font-size:16px">Since only Series methods have a `unique` method, can you think of a creative way of getting the same result as exercise 2 with the `unique` method?</span>

In [47]:
# concatenate the two columns together to create a Series
dept_title = emp['dept'] + emp['title']
dept_title.head()

0                          PolicePOLICE SERGEANT
1                OtherASSISTANT CITY ATTORNEY II
2    Houston Public WorksSENIOR SLUDGE PROCESSOR
3                    PoliceSENIOR POLICE OFFICER
4                    PoliceSENIOR POLICE OFFICER
dtype: object

In [48]:
# Now, use the `unique` method.
len(dept_title.unique())

1312

In [49]:
# in one line
len((emp['dept'] + emp['title']).unique())

1312

### Exercise 4

<span  style="color:green; font-size:16px">Find the occurrence of all race and sex combinations. For instance, you would return an object that contains the number of 'Hispanic Males', 'Black Females', etc...</span>

In [50]:
race_sex = emp['race'] + ' - ' + emp['sex']
race_sex.head()

0       White - Male
1    Hispanic - Male
2       Black - Male
3    Hispanic - Male
4       White - Male
dtype: object

In [51]:
race_sex.value_counts()

White - Male                6488
Black - Male                5074
Hispanic - Male             4208
Black - Female              3587
Hispanic - Female           1940
White - Female              1291
Asian - Male                1059
Asian - Female               488
Native American - Male       107
Native American - Female      39
dtype: int64

In [52]:
# in one line
# normalized to get relative frequency
(emp['race'] + ' - ' + emp['sex']).value_counts(normalize=True).round(3)

White - Male                0.267
Black - Male                0.209
Hispanic - Male             0.173
Black - Female              0.148
Hispanic - Female           0.080
White - Female              0.053
Asian - Male                0.044
Asian - Female              0.020
Native American - Male      0.004
Native American - Female    0.002
dtype: float64

### Exercise 5

<span  style="color:green; font-size:16px">Read in the college dataset setting the institution name (`instnm`) as the index. Select the columns `stabbr`, `satvrmid`, `satmtmid` and `ugds` columns for the state of Texas ('TX') that have an undergraduate student population of more than 20,000. Drop any rows with missing values and assign the result to the variable name `college_tx`. </span>

In [53]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
cols = ['stabbr', 'satvrmid', 'satmtmid', 'ugds']
college_tx = college[cols].query('stabbr == "TX" and ugds > 20000').dropna()
college_tx

Unnamed: 0_level_0,stabbr,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
University of Houston,TX,555.0,590.0,31643.0
University of North Texas,TX,545.0,555.0,29758.0
Texas State University,TX,510.0,515.0,32177.0
Texas A & M University-College Station,TX,580.0,615.0,46941.0
The University of Texas at Arlington,TX,494.0,550.0,29616.0
The University of Texas at Austin,TX,630.0,660.0,38914.0
The University of Texas at San Antonio,TX,505.0,535.0,23815.0
Texas Tech University,TX,540.0,560.0,28278.0


### Exercise 6

<span  style="color:green; font-size:16px">Rank each column from the `college_tx` DataFrame from greatest to least.</span>

In [54]:
college_tx.rank(ascending=False)

Unnamed: 0_level_0,stabbr,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
University of Houston,4.5,3.0,3.0,4.0
University of North Texas,4.5,4.0,5.0,5.0
Texas State University,4.5,6.0,8.0,3.0
Texas A & M University-College Station,4.5,2.0,2.0,1.0
The University of Texas at Arlington,4.5,8.0,6.0,6.0
The University of Texas at Austin,4.5,1.0,1.0,2.0
The University of Texas at San Antonio,4.5,7.0,7.0,8.0
Texas Tech University,4.5,5.0,4.0,7.0


### Exercise 7

<span  style="color:green; font-size:16px">Using the full college dataset, find the largest school by population for each state. Return only the `stabbr` and `ugds` columns and have it sorted by `ugds`.</span>

In [55]:
college[['stabbr', 'ugds']].sort_values('ugds', ascending=False) \
       .drop_duplicates(subset='stabbr').head(10)

Unnamed: 0_level_0,stabbr,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
University of Phoenix-Arizona,AZ,151558.0
Ivy Tech Community College,IN,77657.0
Miami Dade College,FL,61470.0
Lone Star College System,TX,59920.0
Liberty University,VA,49340.0
American Public University System,WV,44924.0
Ashford University,CA,44744.0
Western Governors University,UT,44499.0
Ohio State University-Main Campus,OH,43733.0
Kaplan University-Davenport Campus,IA,40335.0


### Exercise 8

<span  style="color:green; font-size:16px">Several of the columns from the college dataset contain binary data (are either 0 or 1). Can you identify the names of these columns?</span>

Start by finding the number of unique values of each column.

In [56]:
col_unique = college.nunique()
col_unique

city                  2514
stabbr                  59
hbcu                     2
menonly                  2
womenonly                2
relaffil                 2
satvrmid               163
satmtmid               167
distanceonly             2
ugds                  2932
ugds_white            4397
ugds_black            3242
ugds_hisp             2809
ugds_asian            1254
ugds_aian              601
ugds_nhpi              363
ugds_2mor              957
ugds_nra               920
ugds_unkn             1517
pptug_ef              3420
curroper                 2
pctpell               4422
pctfloan              4155
ug25abv               4285
md_earn_wne_p10        598
grad_debt_mdn_supp    2038
dtype: int64

Do boolean indexing to select just those that are binary.

In [57]:
col_unique[col_unique == 2]

hbcu            2
menonly         2
womenonly       2
relaffil        2
distanceonly    2
curroper        2
dtype: int64

Can extract the index to get just the names.

In [58]:
col_unique[col_unique == 2].index

Index(['hbcu', 'menonly', 'womenonly', 'relaffil', 'distanceonly', 'curroper'], dtype='object')

## 5. DataFrame Structure Methods

In [59]:
import pandas as pd
cols = ['instnm', 'city', 'stabbr', 'relaffil', 'satvrmid', 'satmtmid', 'ugds']
college_all = pd.read_csv('../data/college.csv', index_col='instnm', usecols=cols)
college_all.head()

Unnamed: 0_level_0,city,stabbr,relaffil,satvrmid,satmtmid,ugds
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alabama A & M University,Normal,AL,0,424.0,420.0,4206.0
University of Alabama at Birmingham,Birmingham,AL,0,570.0,565.0,11383.0
Amridge University,Montgomery,AL,1,,,291.0
University of Alabama in Huntsville,Huntsville,AL,0,595.0,590.0,5451.0
Alabama State University,Montgomery,AL,0,425.0,430.0,4811.0


### Exercise 1

<span  style="color:green; font-size:16px">Create a new boolean column in the `college_all` DataFrame named 'Verbal Higher' that is True for every college that has a higher verbal than math SAT score. Find the mean of this new column. Why does this number look suspiciously low?</span>

In [60]:
college_all['Verbal Higher'] = college_all['satvrmid'] > college_all['satmtmid']

In [61]:
college_all['Verbal Higher'].mean()

0.048042468480424684

One reason it is so low is that there are mostly missing values for the SAT columns and the comparison operators return False when comparing missing values. Notice that 84% of the values are missing for both SAT columns.

In [62]:
cols = ['satvrmid', 'satmtmid']
college_all[cols].isna().mean()

satvrmid    0.842734
satmtmid    0.841274
dtype: float64

### Exercise 2

<span  style="color:green; font-size:16px">Find the real percentage of schools with higher verbal than math SAT scores.</span>

Drop the rows with missing SAT values first.

In [63]:
cols = ['satvrmid', 'satmtmid']
sat = college_all[cols].dropna()
sat.head()

Unnamed: 0_level_0,satvrmid,satmtmid
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1
Alabama A & M University,424.0,420.0
University of Alabama at Birmingham,570.0,565.0
University of Alabama in Huntsville,595.0,590.0
Alabama State University,425.0,430.0
The University of Alabama,555.0,565.0


In [64]:
(sat['satvrmid'] > sat['satmtmid']).mean()

0.30574324324324326

Can also find all those school with equal scores in both subjects.

In [65]:
(sat['satvrmid'] == sat['satmtmid']).mean()

0.08699324324324324

### Exercise 3

<span  style="color:green; font-size:16px">Create a new column called 'median all' that has every value set to the median population of all the schools.</span>

In [66]:
college_all['median all'] = college_all['ugds'].median()
college_all.head()

Unnamed: 0_level_0,city,stabbr,relaffil,satvrmid,satmtmid,ugds,Verbal Higher,median all
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama A & M University,Normal,AL,0,424.0,420.0,4206.0,True,412.5
University of Alabama at Birmingham,Birmingham,AL,0,570.0,565.0,11383.0,True,412.5
Amridge University,Montgomery,AL,1,,,291.0,False,412.5
University of Alabama in Huntsville,Huntsville,AL,0,595.0,590.0,5451.0,True,412.5
Alabama State University,Montgomery,AL,0,425.0,430.0,4811.0,False,412.5


### Exercise 4

<span  style="color:green; font-size:16px">Rename the row label 'Texas A &amp; M University-College Station' to 'TAMU'. Reassign the result back to `college_all` and then select this row as a Series.</span>

In [67]:
college_all = college_all.rename(index={'Texas A & M University-College Station':'TAMU'})
college_all.loc['TAMU']

city             College Station
stabbr                        TX
relaffil                       0
satvrmid                     580
satmtmid                     615
ugds                       46941
Verbal Higher              False
median all                 412.5
Name: TAMU, dtype: object

Execute the following cell to read in the City of Houston employee dataset and use it for the remaining problems.

In [68]:
emp = pd.read_csv('../data/employee.csv')
emp.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White


### Exercise 5

<span  style="color:green; font-size:16px">Create a new column `bonus` right after the salary column equal to 10% of the salary. Round the bonus to the nearest thousand.</span>

In [69]:
bonus = (emp['salary'] * .1).round(-3)
emp.insert(4, 'bonus', bonus)
emp.head()

Unnamed: 0,dept,title,hire_date,salary,bonus,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,9000.0,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,8000.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,5000.0,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.1,8000.0,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,7000.0,Male,White


### Exercise 6

<span  style="color:green; font-size:16px">Read in the college dataset and set `instnm` as the index and assign it to the variable name `college1`. Use the `copy` method to create a new copy of the `college` DataFrame and assign it to variable `college2`. Select all the non-white race columns (`ugds_black` through `ugds_unkn`).  Sum the rows of this DataFrame and assign the result to a variable. Now drop all the non-white race columns from the `college2` DataFrame and assign the result to `college3`. </span>
    
<span  style="color:green; font-size:16px">Use the `insert` method to insert a new column to the right of the `ugds_white` column of the `college3` DataFrame. Name this column `ugds_nonwhite`.</span>

In [70]:
college1 = pd.read_csv('../data/college.csv', index_col='instnm')
college2 = college1.copy()
college2_race = college2.loc[:, 'ugds_black':'ugds_unkn']
college2_race.head()

Unnamed: 0_level_0,ugds_black,ugds_hisp,ugds_asian,ugds_aian,ugds_nhpi,ugds_2mor,ugds_nra,ugds_unkn
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Alabama A & M University,0.9353,0.0055,0.0019,0.0024,0.0019,0.0,0.0059,0.0138
University of Alabama at Birmingham,0.26,0.0283,0.0518,0.0022,0.0007,0.0368,0.0179,0.01
Amridge University,0.4192,0.0069,0.0034,0.0,0.0,0.0,0.0,0.2715
University of Alabama in Huntsville,0.1255,0.0382,0.0376,0.0143,0.0002,0.0172,0.0332,0.035
Alabama State University,0.9208,0.0121,0.0019,0.001,0.0006,0.0098,0.0243,0.0137


In [71]:
non_white = college2_race.sum(axis='columns')
non_white.head()

instnm
Alabama A & M University               0.9667
University of Alabama at Birmingham    0.4077
Amridge University                     0.7010
University of Alabama in Huntsville    0.3012
Alabama State University               0.9842
dtype: float64

In [72]:
college3 = college2.drop(columns=college2_race.columns)
college3.head(2)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [73]:
college3.insert(11, 'ugds_nonwhite', non_white)

In [74]:
college3.head(2)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_nonwhite,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9667,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.4077,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


In [75]:
# all in one cell
college1 = pd.read_csv('../data/college.csv', index_col='instnm')
college2 = college1.copy()
college2_race = college2.loc[:, 'ugds_black':'ugds_unkn']
non_white = college2_race.sum(axis='columns')
college3 = college2.drop(columns=college2_race.columns)
college3.insert(11, 'ugds_nonwhite', non_white)
college3.head(2)

Unnamed: 0_level_0,city,stabbr,hbcu,menonly,womenonly,relaffil,satvrmid,satmtmid,distanceonly,ugds,ugds_white,ugds_nonwhite,pptug_ef,curroper,pctpell,pctfloan,ug25abv,md_earn_wne_p10,grad_debt_mdn_supp
instnm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Alabama A & M University,Normal,AL,1.0,0.0,0.0,0,424.0,420.0,0.0,4206.0,0.0333,0.9667,0.0656,1,0.7356,0.8284,0.1049,30300,33888.0
University of Alabama at Birmingham,Birmingham,AL,0.0,0.0,0.0,0,570.0,565.0,0.0,11383.0,0.5922,0.4077,0.2607,1,0.346,0.5214,0.2422,39700,21941.5


## 6. DataFrame Methods More

In [76]:
import pandas as pd
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


### Exercise 1

<span  style="color:green; font-size:16px">Find the relative frequency of departments for all employees and then find the relative frequency of departments for the top 100 salaries. Compare the differences.</span>

In [77]:
dept_freq = emp['dept'].value_counts(normalize=True)
dept_freq.round(2)

Police                     0.31
Fire                       0.18
Houston Public Works       0.17
Other                      0.14
Health & Human Services    0.06
Houston Airport System     0.05
Parks & Recreation         0.05
Library                    0.02
Solid Waste Management     0.02
Name: dept, dtype: float64

In [78]:
emp_top100 = emp.nlargest(100, 'salary')
emp_top100.head()

Unnamed: 0,dept,title,hire_date,salary,sex,race
1732,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,White
1975,Fire,"PHYSICIAN,MD",2014-09-27,342784.0,Male,Asian
4680,Fire,"PHYSICIAN,MD",2015-11-23,342784.0,Male,White
4882,Fire,"PHYSICIAN,MD",2017-01-09,342784.0,Male,White
6501,Fire,"PHYSICIAN,MD",2016-05-31,342784.0,Male,White


In [79]:
dept_freq_top100 = emp_top100['dept'].value_counts(normalize=True)
dept_freq_top100

Other                      0.36
Fire                       0.22
Police                     0.15
Houston Airport System     0.09
Houston Public Works       0.09
Health & Human Services    0.07
Solid Waste Management     0.01
Library                    0.01
Name: dept, dtype: float64

In [80]:
# The police dept makes up 31% of employees 
# but only 16% of the top 100 salaries
dept_freq_top100 - dept_freq

Fire                       0.039977
Health & Human Services    0.014339
Houston Airport System     0.039975
Houston Public Works      -0.082371
Library                   -0.013161
Other                      0.221239
Parks & Recreation              NaN
Police                    -0.161544
Solid Waste Management    -0.011063
Name: dept, dtype: float64

### Exercise 2

<span  style="color:green; font-size:16px">Find the day that each stock had its largest percentage one-day drop in price.</span>

In [81]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', parse_dates=['date'])
stocks.head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


In [82]:
stocks.pct_change().idxmin()

MSFT   2000-04-24
AAPL   2000-09-29
SLB    2008-10-15
AMZN   2001-07-24
TSLA   2012-01-13
XOM    2008-10-15
WMT    2018-02-20
T      2000-12-19
FB     2018-07-26
V      2008-10-15
dtype: datetime64[ns]

## 7. Assigning Subsets of Data

Use the bikes dataset for all of the following exercises.

In [83]:
import pandas as pd
bikes = pd.read_csv('../data/bikes.csv')

### Exercise 1
<span  style="color:green; font-size:16px">Change the values of `events` to 'HEAT WAVE' for all rides where `temperature` is above 95. Verify this by outputting just the `events` and `temperature` columns that meet the condition.</span>

In [84]:
filt = bikes['temperature'] > 95
bikes.loc[filt, 'events'] = 'HEAT WAVE'
bikes.loc[filt, ['temperature', 'events']]

Unnamed: 0,temperature,events
395,96.1,HEAT WAVE
396,96.1,HEAT WAVE
397,96.1,HEAT WAVE


### Exercise 2
<span  style="color:green; font-size:16px">Increase the trip duration by 50% for all the rides that took place with a wind speed above 40. Output just the trip duration and wind speed before and after the assignment.</span>

In [85]:
filt = bikes['wind_speed'] > 40
cols = ['tripduration', 'wind_speed']
bikes.loc[filt, cols]

Unnamed: 0,tripduration,wind_speed
22306,130,42.6
22307,528,42.6
22308,358,42.6
22309,221,41.4


In [86]:
bikes.loc[filt, 'tripduration'] *= 1.5
bikes.loc[filt, cols]

Unnamed: 0,tripduration,wind_speed
22306,195.0,42.6
22307,792.0,42.6
22308,537.0,42.6
22309,331.5,41.4


### Exercise 3
<span  style="color:green; font-size:16px">Change the trip duration for the first two rows to 0.</span>

In [87]:
bikes.iloc[:2, 5] = 0
bikes.head(2)

Unnamed: 0,trip_id,usertype,gender,starttime,stoptime,tripduration,from_station_name,latitude_start,longitude_start,dpcapacity_start,to_station_name,latitude_end,longitude_end,dpcapacity_end,temperature,visibility,wind_speed,precipitation,events
0,7147,Subscriber,Male,2013-06-28 19:01:00,2013-06-28 19:17:00,0.0,Lake Shore Dr & Monroe St,41.88105,-87.61697,11.0,Michigan Ave & Oak St,41.90096,-87.623777,15.0,73.9,10.0,12.7,-9999.0,mostlycloudy
1,7524,Subscriber,Male,2013-06-28 22:53:00,2013-06-28 23:03:00,0.0,Clinton St & Washington Blvd,41.88338,-87.64117,31.0,Wells St & Walton St,41.89993,-87.63443,19.0,69.1,10.0,6.9,-9999.0,partlycloudy
