# Solutions
This notebook contains solutions to all the exercises from the Selecting Subsets of Data Series

1. [Part 1: Selection with [], .loc and .iloc](#Part-1:-Selection-with-[],-.loc-and-.iloc)
1. [Part 2: Boolean Indexing](#Part-2:-Boolean-Indexing)
1. [Part 3: Assigning subsets of data](#Part-3:-Assigning-subsets-of-data)
1. [Part 4: How NOT to select subsets of data](#Part-4:-How-NOT-to-select-subsets-of-data)

In [1]:
import pandas as pd
import numpy as np

# Part 1: Selection with `[]`, `.loc` and `.iloc`

In [2]:
df = pd.read_csv('../../data/food_inspections.csv')
df.head()

Unnamed: 0,DBA Name,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations
0,DANY'S TACOS,Restaurant,Risk 1 (High),2857 S ST LOUIS AVE,60623.0,03/27/2017,License,Fail,"16. FOOD PROTECTED DURING STORAGE, PREPARATION..."
1,BILLY FOOD MARKET INC,,Risk 3 (Low),3906 W ROOSEVELT RD,60624.0,03/27/2017,License,Not Ready,
2,TAQUERIA HACIENDA TAPATIA,Restaurant,Risk 1 (High),4125 W 26TH ST,60623.0,03/27/2017,License Re-Inspection,Pass,2. FACILITIES TO MAINTAIN PROPER TEMPERATURE -...
3,WILD GOOSE BAR & GRILL,Restaurant,Risk 1 (High),4265 N LINCOLN AVE,60618.0,03/27/2017,Canvass,Fail,"16. FOOD PROTECTED DURING STORAGE, PREPARATION..."
4,PUBLICAN TAVERN K1,Restaurant,Risk 1 (High),11601 W TOUHY AVE,60666.0,03/27/2017,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...


### Exercise 1
<span  style="color:green; font-size:16px">The current DataFrame has a simple `RangeIndex`. Let make the **`DBA Name`** column the index to make it more meaningful. Save the result to variable **`df`**.</span>

In [3]:
df = df.set_index('DBA Name')
df.head()

Unnamed: 0_level_0,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
DANY'S TACOS,Restaurant,Risk 1 (High),2857 S ST LOUIS AVE,60623.0,03/27/2017,License,Fail,"16. FOOD PROTECTED DURING STORAGE, PREPARATION..."
BILLY FOOD MARKET INC,,Risk 3 (Low),3906 W ROOSEVELT RD,60624.0,03/27/2017,License,Not Ready,
TAQUERIA HACIENDA TAPATIA,Restaurant,Risk 1 (High),4125 W 26TH ST,60623.0,03/27/2017,License Re-Inspection,Pass,2. FACILITIES TO MAINTAIN PROPER TEMPERATURE -...
WILD GOOSE BAR & GRILL,Restaurant,Risk 1 (High),4265 N LINCOLN AVE,60618.0,03/27/2017,Canvass,Fail,"16. FOOD PROTECTED DURING STORAGE, PREPARATION..."
PUBLICAN TAVERN K1,Restaurant,Risk 1 (High),11601 W TOUHY AVE,60666.0,03/27/2017,Canvass,Fail,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...


### Exercise 2
<span  style="color:green; font-size:16px">Select the **`Risk`** column as a Series with just the indexing operator. Also select it with attribute access.</span>

In [4]:
df['Risk'].head()

DBA Name
DANY'S TACOS                 Risk 1 (High)
BILLY FOOD MARKET INC         Risk 3 (Low)
TAQUERIA HACIENDA TAPATIA    Risk 1 (High)
WILD GOOSE BAR & GRILL       Risk 1 (High)
PUBLICAN TAVERN K1           Risk 1 (High)
Name: Risk, dtype: object

In [5]:
df.Risk.head()

DBA Name
DANY'S TACOS                 Risk 1 (High)
BILLY FOOD MARKET INC         Risk 3 (Low)
TAQUERIA HACIENDA TAPATIA    Risk 1 (High)
WILD GOOSE BAR & GRILL       Risk 1 (High)
PUBLICAN TAVERN K1           Risk 1 (High)
Name: Risk, dtype: object

### Exercise 3
<span  style="color:green; font-size:16px">Select the **`Risk`** and **`Results`** columns</span>

In [6]:
df[['Risk', 'Results']].head()

Unnamed: 0_level_0,Risk,Results
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1
DANY'S TACOS,Risk 1 (High),Fail
BILLY FOOD MARKET INC,Risk 3 (Low),Not Ready
TAQUERIA HACIENDA TAPATIA,Risk 1 (High),Pass
WILD GOOSE BAR & GRILL,Risk 1 (High),Fail
PUBLICAN TAVERN K1,Risk 1 (High),Fail


### Exercise 4
<span  style="color:green; font-size:16px">Select a single column as a DataFrame</span>

In [7]:
df[['Zip']].head()

Unnamed: 0_level_0,Zip
DBA Name,Unnamed: 1_level_1
DANY'S TACOS,60623.0
BILLY FOOD MARKET INC,60624.0
TAQUERIA HACIENDA TAPATIA,60623.0
WILD GOOSE BAR & GRILL,60618.0
PUBLICAN TAVERN K1,60666.0


### Exercise 5
<span  style="color:green; font-size:16px">Select the row for the restaurant **`WILD GOOSE BAR & GRILL`**. What object is returned?</span>

In [8]:
df.loc['WILD GOOSE BAR & GRILL'] # returns a Series

Facility Type                                             Restaurant
Risk                                                   Risk 1 (High)
Address                                          4265 N LINCOLN AVE 
Zip                                                            60618
Inspection Date                                           03/27/2017
Inspection Type                                              Canvass
Results                                                         Fail
Violations         16. FOOD PROTECTED DURING STORAGE, PREPARATION...
Name: WILD GOOSE BAR & GRILL, dtype: object

### Exercise 6
<span  style="color:green; font-size:16px">Select the rows for the restaurants **`WILD GOOSE BAR & GRILL`** and **`TAQUERIA HACIENDA TAPATIA`** along with columns **`Risk`** and **`Results`**.</span>

In [9]:
df.loc[['WILD GOOSE BAR & GRILL', 'TAQUERIA HACIENDA TAPATIA'], ['Risk', 'Results']]

Unnamed: 0_level_0,Risk,Results
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1
WILD GOOSE BAR & GRILL,Risk 1 (High),Fail
TAQUERIA HACIENDA TAPATIA,Risk 1 (High),Pass


### Exercise 7
<span  style="color:green; font-size:16px">What is the risk of restaurant **`SCRUB A DUB`**?</span>

In [10]:
df.loc['SCRUB A DUB', 'Risk']

'Risk 2 (Medium)'

### Exercise 8
<span  style="color:green; font-size:16px">Select every 3,000th restaurant from **`THRESHOLD SCHOOL`** to **`SCRUB A DUB`** and the columns from **`Inspection Type`** on to the end of the DataFrame.</span>

In [11]:
df.loc['THRESHOLD SCHOOL':'SCRUB A DUB':3000, 'Inspection Type':]

Unnamed: 0_level_0,Inspection Type,Results,Violations
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
THRESHOLD SCHOOL,Canvass,Pass,38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS...
SAN JOSE FAST FOOD,Canvass,Out of Business,
CHUCHOS ON ADDISON INC,Canvass,Pass w/ Conditions,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
PIZZERIA AROMA,Canvass Re-Inspection,Pass,2. FACILITIES TO MAINTAIN PROPER TEMPERATURE -...
JEWEL FOOD STORE #3210,Canvass,No Entry,
EGANS TAVERN,Canvass,Out of Business,
Los Amigos Grill,Canvass,Out of Business,
JOHNNY VAN'S SMOKEHOUSE,License,Pass,
WEN CAFE INC,Out of Business,Fail,


### Exercise 9
<span  style="color:green; font-size:16px">Select all columns from the 500th restaurant to the 510th</span>

In [12]:
df.iloc[500:510]

Unnamed: 0_level_0,Facility Type,Risk,Address,Zip,Inspection Date,Inspection Type,Results,Violations
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
THE MECCA RESTAURANT,Restaurant,Risk 1 (High),6666 N NORTHWEST HWY,60631.0,03/14/2017,Canvass,Pass w/ Conditions,8. SANITIZING RINSE FOR EQUIPMENT AND UTENSILS...
L & T CHINA FAST WOK INC.,Restaurant,Risk 1 (High),2020 N CALIFORNIA AVE,60647.0,03/14/2017,Canvass,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
SCHNEIDER SENIOR APARTMENT,Golden Diner,Risk 1 (High),1750 W PETERSON AVE,60660.0,03/14/2017,Canvass Re-Inspection,Pass,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR..."
ECKHARDT CAFE,Restaurant,Risk 1 (High),5640 S ELLIS AVE,60637.0,03/14/2017,Canvass,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
WING ZONE,Restaurant,Risk 1 (High),1757 W 87TH ST,60620.0,03/14/2017,Canvass,Out of Business,
ARBYS,Restaurant,Risk 1 (High),500 W MADISON ST,60661.0,03/14/2017,Canvass,Pass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO..."
LAS ISLAS MARIAS,Restaurant,Risk 1 (High),8205-8209 S PULASKI RD,60652.0,03/14/2017,Canvass,Pass,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL..."
MARGARET'S,Restaurant,Risk 1 (High),5134 W IRVING PARK RD,60641.0,03/14/2017,Canvass,Pass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...
POPEYES FAMOUS FRIED CHICKEN,Restaurant,Risk 2 (Medium),9516 S VINCENNES AVE,60643.0,03/14/2017,Complaint,Pass w/ Conditions,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...
PJ PIZZA COMPANY 1 DBA PAPA JOHNS,Restaurant,Risk 2 (Medium),1755 W 87TH ST,60620.0,03/14/2017,Canvass,Out of Business,


### Exercise 10
<span  style="color:green; font-size:16px">Select restaurants 100, 1,000 and 10,000 along with columns 5, 3, and 1</span>

In [13]:
df.iloc[[100, 1000, 10000], [5, 3, 1]]

Unnamed: 0_level_0,Inspection Type,Zip,Risk
DBA Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
TAQUERIA LA CANTERA,License,60608.0,Risk 1 (High)
PARK TAVERN CHICAGO,Complaint,60612.0,Risk 1 (High)
LITTLE VILLAGE HIGH SCHOOL,Canvass,60623.0,Risk 1 (High)


### Exercise 11
<span  style="color:green; font-size:16px">Select the **`Risk`** column and save it to a Series</span>

In [14]:
risk = df['Risk']
risk.head()

DBA Name
DANY'S TACOS                 Risk 1 (High)
BILLY FOOD MARKET INC         Risk 3 (Low)
TAQUERIA HACIENDA TAPATIA    Risk 1 (High)
WILD GOOSE BAR & GRILL       Risk 1 (High)
PUBLICAN TAVERN K1           Risk 1 (High)
Name: Risk, dtype: object

### Exercise 12
<span  style="color:green; font-size:16px">Using the risk Series, select **`ARBYS`** and **`POPEYES FAMOUS FRIED CHICKEN`**</span>

In [15]:
risk.loc[['ARBYS', 'POPEYES FAMOUS FRIED CHICKEN']]

DBA Name
ARBYS                             Risk 1 (High)
POPEYES FAMOUS FRIED CHICKEN    Risk 2 (Medium)
Name: Risk, dtype: object

# Part 2: Boolean Indexing

In [16]:
import pandas as pd
import numpy as np

so = pd.read_csv('../../data/stackoverflow_qa.csv')
so.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
0,5486226,2011-03-30 12:26:50,4,2113,Rolling median in python,3,4,1.0,yueerhu,125.0,Mike Pennington,26995.0
1,5515021,2011-04-01 14:50:44,8,7015,Compute a compounded return series in Python,3,6,7.0,Jason Strimpel,3301.0,Mike Pennington,26995.0
2,5558607,2011-04-05 21:13:50,2,7392,Sort a pandas DataMatrix in ascending order,2,0,1.0,Jason Strimpel,3301.0,Wes McKinney,43310.0
3,6467832,2011-06-24 12:31:45,9,13056,How to get the correlation between two timeser...,1,0,7.0,user814005,117.0,Wes McKinney,43310.0
4,7577546,2011-09-28 01:58:38,9,2488,"Using pandas, how do I subsample a large DataF...",1,0,5.0,Uri Laserson,958.0,HYRY,54137.0


### Exercise 1
<span  style="color:green; font-size:16px">Find all the questions that have exactly 5 answers.</span>

In [17]:
so[so['answercount'] == 5].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
75,10373660,2012-04-29 16:10:35,199,207980,Converting a Pandas GroupBy object to DataFrame,5,0,90.0,saveenr,2421.0,Wes McKinney,43310.0
115,10791661,2012-05-29 00:06:52,7,12210,How do I discretize values in a pandas DataFra...,5,0,3.0,Uri Laserson,958.0,lbolla,4552.0
130,10972410,2012-06-10 21:12:43,19,46428,pandas: combine two columns in a DataFrame,5,0,5.0,BFTM,895.0,BrenBarn,136870.0
189,11391969,2012-07-09 09:04:33,44,27467,How to group pandas DataFrame entries by date ...,5,1,15.0,Boris Gorelik,9605.0,Wes McKinney,43310.0
246,11811392,2012-08-04 19:25:34,17,20051,How to generate a list from a pandas DataFrame...,5,0,6.0,turtle,1260.0,BrenBarn,136870.0


### Exercise 2
<span  style="color:green; font-size:16px">Find all the questions that have less than 10 views</span>

In [18]:
so[so['viewcount'] < 10].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
7787,26493306,2014-10-21 18:06:10,1,6,How to convert hierarchical DataFrame back fro...,0,2,1.0,exp1orer,3551.0,,
17653,33641540,2015-11-10 23:19:36,0,9,Joining Dataframes in Pandas deletes an existi...,1,0,,Rahul Biswas,88.0,,
29414,40062672,2016-10-15 18:24:21,-1,9,Replace one or more sub-strings from multiple ...,1,0,,Andreuccio,186.0,,
36086,42538091,2017-03-01 17:20:37,0,9,Saving box plot pandas python,0,0,,user2906657,119.0,,
36340,42080039,2017-02-07 01:13:30,0,8,pandas: count the non-duplicated elements when...,1,0,,Edamame,2281.0,root,12202.0


### Exercise 3
<span  style="color:green; font-size:16px">Find all the questions where the person asking it is the same as the person answering it</span>

In [19]:
so[so['quest_name'] == so['ans_name']].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
10,8273092,2011-11-25 18:39:02,1,2333,python: pandas install errors,2,0,,codingknob,2279.0,codingknob,2279.0
33,9721429,2012-03-15 14:08:31,9,5732,How do I read a fix width format text file in ...,4,2,,user1234440,3369.0,user1234440,3369.0
58,10020591,2012-04-04 23:17:23,21,20171,How to resample a dataframe with different fun...,4,0,15.0,bmu,17129.0,bmu,17129.0
68,10175068,2012-04-16 13:27:44,15,7501,Select data at a particular level from a Multi...,2,3,5.0,elyase,19551.0,elyase,19551.0
74,10264739,2012-04-22 02:35:06,9,2362,major memory problems reading in a csv file us...,3,2,7.0,vgoklani,1752.0,vgoklani,1752.0


### Exercise 4
<span  style="color:green; font-size:16px">Find all the questions that don't have an accepted answer, but have a score of more than 100</span>

In [20]:
so[so['ans_name'].isnull() & (so['score'] > 100)]

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
184,11350770,2012-07-05 18:57:34,149,155776,pandas + dataframe - select by partial string,7,0,52.0,euforia,816.0,,
514,13148429,2012-10-30 22:22:59,312,228357,How to change the order of DataFrame columns?,19,1,90.0,Timmie,1689.0,,
527,13187778,2012-11-02 00:57:33,102,202329,"Convert pandas dataframe to numpy array, prese...",8,0,45.0,mister.nobody.nz,511.0,,
560,13331698,2012-11-11 13:48:53,154,174964,How to apply a function to two columns of Pand...,9,4,66.0,bigbug,7865.0,,
5013,19377969,2013-10-15 09:42:52,128,144045,Combine two columns of text in dataframe in pa...,10,1,49.0,user2866103,831.0,,
5370,20845213,2013-12-30 18:24:25,130,61332,How to avoid Python/Pandas creating an index i...,3,3,14.0,Alexis,835.0,,
8289,26266362,2014-10-08 21:00:19,132,108655,How to count the Nan values in the column in P...,9,0,39.0,user3799307,661.0,,


### Exercise 5
<span  style="color:green; font-size:16px">Find all the questions where the reputation of the person asking the question is higher than the person answering it. Then find the percentage of times this happens</span>

In [21]:
quest_higher_rep = so[so['quest_rep'] > so['ans_rep']]
quest_higher_rep.head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
26,9588331,2012-03-06 17:01:47,22,10038,Simple cross-tabulation in pandas,2,3,5.0,Jon Clements,85944.0,Jeff Hammerbacher,3172.0
51,9962114,2012-04-01 05:25:20,9,8079,Control Charts in Python,1,0,3.0,John,8807.0,Josh Hemann,590.0
99,10636024,2012-05-17 12:48:00,36,30386,Python / Pandas - GUI for viewing a DataFrame ...,15,6,27.0,Ross R,1562.0,user1319128,555.0
101,10671227,2012-05-20 06:14:57,0,473,Pandas FloatingPoint Error,1,5,,tshauck,5957.0,Karmel,2870.0
112,10760364,2012-05-25 19:33:39,8,2977,"""Zebra Tables"" in IPython Notebook?",3,0,4.0,JD Long,33098.0,minrk,22244.0


In [22]:
# about 4% of the time
len(quest_higher_rep) / len(so)

0.039717720486542075

In [23]:
# advanced - same solution
(so['quest_rep'] > so['ans_rep']).mean()

0.039717720486542075

### Exercise 6
<span  style="color:green; font-size:16px">Find all the questions where the number of answers is between 5 and 10 inclusive and the number of views is less than 1,000.</span>

In [24]:
so[so['answercount'].between(5, 10) & (so['viewcount'] < 1000)].head()

Unnamed: 0,id,creationdate,score,viewcount,title,answercount,commentcount,favoritecount,quest_name,quest_rep,ans_name,ans_rep
620,13580875,2012-11-27 09:19:07,2,963,Running average / frequency of time series data?,5,9,1.0,Prof. Falken,15061.0,Victor K,332.0
1316,14742893,2013-02-07 03:13:31,2,598,Interpolation Function,5,0,,eWizardII,886.0,,
4816,21315997,2014-01-23 18:02:57,2,716,Python Pandas vs R. Transformation Code Concis...,5,2,2.0,user2684301,1465.0,Jeff,62248.0
4958,19892107,2013-11-10 16:28:50,5,385,Updating csv with data from a csv with differe...,6,0,1.0,OliverSteph,32.0,,
5339,21040578,2014-01-10 09:33:09,-1,175,combining mutiple csv file,8,2,,user104853,16.0,,


### Exercise 7
<span  style="color:green; font-size:16px">Find the inverse of exercise 6. Verify your results by adding the rows of both returned Series to see if it matches the number of rows of the original</span>

In [25]:
ex6 = so[so['answercount'].between(5, 10) & (so['viewcount'] < 1000)]

criteria = (so['answercount'].between(5, 10) & (so['viewcount'] < 1000))
ex7 = so[~criteria]

In [26]:
len(ex6) + len(ex7)

56398

In [27]:
len(so)

56398

### Use the employee data for the rest of the exercises

In [28]:
employee = pd.read_csv('../../data/employee.csv')
employee.head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,HIRE_DATE,JOB_DATE
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic/Latino,Full Time,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Full Time,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Full Time,Male,1989-06-19,1994-10-22


In [29]:
employee.RACE.value_counts()

Black or African American            700
White                                665
Hispanic/Latino                      480
Asian/Pacific Islander               107
American Indian or Alaskan Native     11
Others                                 2
Name: RACE, dtype: int64

### Exercise 8
<span  style="color:green; font-size:16px">Find all the **`Black or African American`** females that work in the **`Houston Police Department-HPD`**</span>

In [30]:
c1 = employee['RACE'] == 'Black or African American'
c2 = employee['GENDER'] == 'Female'
c3 = employee['DEPARTMENT'] == 'Houston Police Department-HPD'
c_all = c1 & c2 & c3
employee[c_all].head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,HIRE_DATE,JOB_DATE
55,ADMINISTRATIVE ASSOCIATE,Houston Police Department-HPD,34757.0,Black or African American,Full Time,Female,2005-05-11,2008-05-17
113,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black or African American,Full Time,Female,1994-06-20,2007-02-24
137,JAIL ATTENDANT,Houston Police Department-HPD,36317.0,Black or African American,Full Time,Female,2008-11-24,2013-07-17
196,SENIOR POLICE OFFICER,Houston Police Department-HPD,66614.0,Black or African American,Full Time,Female,1992-08-31,2008-03-08
228,FINANCIAL ANALYST III,Houston Police Department-HPD,59182.0,Black or African American,Full Time,Female,1985-04-15,2009-07-25


### Exercise 9
<span  style="color:green; font-size:16px">Find the females that have a salary over 100,000 OR males with salary under 50,000</span>

In [31]:
c1 = (employee['GENDER'] == 'Female') & (employee['BASE_SALARY'] > 100000)
c2 = (employee['GENDER'] == 'Male') & (employee['BASE_SALARY'] < 50000)
c_all = c1 | c2
employee[c_all].head()

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,HIRE_DATE,JOB_DATE
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic/Latino,Full Time,Female,2006-06-12,2012-10-13
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Full Time,Male,2015-02-03,2015-02-03
7,CARPENTER,Houston Airport System (HAS),42390.0,White,Full Time,Male,2013-11-04,2013-11-04
9,AIRPORT OPERATIONS COORDINATOR,Houston Airport System (HAS),44616.0,,Full Time,Male,2016-03-14,2016-03-14
12,CUSTOMER SERVICE REPRESENTATIVE I,Public Works & Engineering-PWE,30347.0,Black or African American,Full Time,Male,2015-11-16,2015-11-16


### Exercise 10
<span  style="color:green; font-size:16px">Find the females in the following departments with salary over 60,000 (Parks & Recreation, Solid Waste Management, Fleet Management Department, Library)  </span>

In [32]:
deps = ['Parks & Recreation', 'Solid Waste Management', 'Fleet Management Department', 'Library']
c1 = employee['DEPARTMENT'].isin(deps) 
c2 = employee['GENDER'] == 'Female'
c3 = employee['BASE_SALARY'] > 60000

employee[c1 & c2 & c3]

Unnamed: 0,POSITION_TITLE,DEPARTMENT,BASE_SALARY,RACE,EMPLOYMENT_TYPE,GENDER,HIRE_DATE,JOB_DATE
249,STAFF ANALYST,Solid Waste Management,75041.0,Hispanic/Latino,Full Time,Female,1992-10-21,2015-07-18
412,ADMINISTRATIVE COORDINATOR,Library,79302.0,Hispanic/Latino,Full Time,Female,2003-12-22,2005-04-02
476,ADMINISTRATIVE SUPERVISOR,Library,60632.0,Black or African American,Full Time,Female,1993-05-10,2014-11-08
892,LIBRARIAN III,Library,61454.0,White,Full Time,Female,1998-11-02,2002-06-15
1165,DEPUTY ASSISTANT DIRECTOR (EXECUTIVE LEV,Library,107763.0,Black or African American,Full Time,Female,1993-11-16,2014-03-15
1484,SENIOR STAFF ANALYST (EXECUTIVE LEVEL),Parks & Recreation,83916.0,Hispanic/Latino,Full Time,Female,1999-07-26,2013-08-31


### Exercise 11
<span  style="color:green; font-size:16px">Find all the males with salary over 100,000. Return only the race, gender and salary columns</span>

In [33]:
criteria = (employee['GENDER'] == 'Male') & (employee['BASE_SALARY'] > 100000)
employee.loc[criteria, ['RACE', 'GENDER', 'BASE_SALARY']].head()

Unnamed: 0,RACE,GENDER,BASE_SALARY
8,White,Male,107962.0
11,Black or African American,Male,180416.0
43,Hispanic/Latino,Male,165216.0
169,White,Male,120916.0
178,White,Male,210588.0


### Exercise 12
<span  style="color:green; font-size:16px">Select all salaries as a Series in a separate variable. From this series select all salaries under 25,000</span>

In [34]:
s = employee['BASE_SALARY']
s[s < 25000]

454    24960.0
Name: BASE_SALARY, dtype: float64

### Exercise 13
<span  style="color:green; font-size:16px">Get the same exact result as exercise 11, but make your selection from the employee DataFrame. Use only a single line of code</span>

In [35]:
employee.loc[employee['BASE_SALARY'] < 25000, 'BASE_SALARY']

454    24960.0
Name: BASE_SALARY, dtype: float64

# Part 3: Assigning subsets of data

In [36]:
import pandas as pd
import numpy as np

In [37]:
df = pd.read_csv('../../data/employee_sample.csv', index_col=0)
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY
Tom,Male,White,Engineering,23,107962
Niko,Male,Black,Engineering,1,30347
Penelope,Female,White,Engineering,12,60258
Aria,Female,Black,Engineering,8,43618
Sofia,Female,Black,Parks & Recreation,23,26125
Dean,Male,Black,Parks & Recreation,3,33592
Zach,Male,White,Parks & Recreation,4,37565


In [38]:
def style_diff(df, df_orig):
    style = {True: '', 
             False: 'color: red; background-color: yellow'}
    df_style = (df == df_orig).replace(style)
    return df.style.apply(lambda x: df_style, axis=None)

### Exercise 1
<span  style="color:green; font-size:16px"> Create a Series with the index equal to the same values in **`df`** but in a different order. Make the values of the Series integers representing the employee's age ranging somewhere from 20 to 65. Create a new column **`AGE`**. Does the order of the index matter?</span>

In [39]:
s = pd.Series(data=[25, 30, 40, 50, 35, 36, 44],
              index=['Aria', 'Niko', 'Tom', 'Zach', 'Penelope', 'Sofia', 'Dean'])
s

Aria        25
Niko        30
Tom         40
Zach        50
Penelope    35
Sofia       36
Dean        44
dtype: int64

In [40]:
df['AGE'] = s
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE
Tom,Male,White,Engineering,23,107962,40
Niko,Male,Black,Engineering,1,30347,30
Penelope,Female,White,Engineering,12,60258,35
Aria,Female,Black,Engineering,8,43618,25
Sofia,Female,Black,Parks & Recreation,23,26125,36
Dean,Male,Black,Parks & Recreation,3,33592,44
Zach,Male,White,Parks & Recreation,4,37565,50


The order of the index does not matter. They automatically align on their index.

### Exercise 2
<span  style="color:green; font-size:16px"> Create a new column **`BONUS`** equal to 0 for everyone</span>

In [41]:
df['BONUS'] = 0
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS
Tom,Male,White,Engineering,23,107962,40,0
Niko,Male,Black,Engineering,1,30347,30,0
Penelope,Female,White,Engineering,12,60258,35,0
Aria,Female,Black,Engineering,8,43618,25,0
Sofia,Female,Black,Parks & Recreation,23,26125,36,0
Dean,Male,Black,Parks & Recreation,3,33592,44,0
Zach,Male,White,Parks & Recreation,4,37565,50,0


### Exercise 3
<span  style="color:green; font-size:16px"> Change the **`BONUS`** column so that everyone with more than 10 years of experience get $10,000. Use the **`style_diff`** function to display the results.</span>

In [42]:
df_orig = df.copy()
df.loc[df['YEARS EXPERIENCE'] > 10, 'BONUS'] = 10000
style_diff(df, df_orig)

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS
Tom,Male,White,Engineering,23,107962,40,10000
Niko,Male,Black,Engineering,1,30347,30,0
Penelope,Female,White,Engineering,12,60258,35,10000
Aria,Female,Black,Engineering,8,43618,25,0
Sofia,Female,Black,Parks & Recreation,23,26125,36,10000
Dean,Male,Black,Parks & Recreation,3,33592,44,0
Zach,Male,White,Parks & Recreation,4,37565,50,0


### Exercise 4
<span  style="color:green; font-size:16px"> Create a new column **`TOTAL SALARY`** that is 10% higher than the current **`SALARY`** column and add the bonus on top of that as well. Use **`.loc`** on the left-hand side and NOT *just the indexing operator*. Make the data type **`TOTAL SALARY`** an integer.</span>

In [43]:
df.loc[:, 'TOTAL SALARY'] = df['SALARY'] * 1.1 + df['BONUS']
df.loc[:, 'TOTAL SALARY'] = df.loc[:, 'TOTAL SALARY'].astype(int)
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS,TOTAL SALARY
Tom,Male,White,Engineering,23,107962,40,10000,128758
Niko,Male,Black,Engineering,1,30347,30,0,33381
Penelope,Female,White,Engineering,12,60258,35,10000,76283
Aria,Female,Black,Engineering,8,43618,25,0,47979
Sofia,Female,Black,Parks & Recreation,23,26125,36,10000,38737
Dean,Male,Black,Parks & Recreation,3,33592,44,0,36951
Zach,Male,White,Parks & Recreation,4,37565,50,0,41321


In [44]:
# in one line
df.loc[:, 'TOTAL SALARY'] = (df['SALARY'] * 1.1 + df['BONUS']).astype(int)
df

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS,TOTAL SALARY
Tom,Male,White,Engineering,23,107962,40,10000,128758
Niko,Male,Black,Engineering,1,30347,30,0,33381
Penelope,Female,White,Engineering,12,60258,35,10000,76283
Aria,Female,Black,Engineering,8,43618,25,0,47979
Sofia,Female,Black,Parks & Recreation,23,26125,36,10000,38737
Dean,Male,Black,Parks & Recreation,3,33592,44,0,36951
Zach,Male,White,Parks & Recreation,4,37565,50,0,41321


### Exercise 5
<span  style="color:green; font-size:16px"> Set Aria's department to 'Police'. Highlight the change</span>

In [45]:
df_orig = df.copy()
df.loc['Aria', 'DEPARTMENT'] = 'Police'
style_diff(df, df_orig)

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS,TOTAL SALARY
Tom,Male,White,Engineering,23,107962,40,10000,128758
Niko,Male,Black,Engineering,1,30347,30,0,33381
Penelope,Female,White,Engineering,12,60258,35,10000,76283
Aria,Female,Black,Police,8,43618,25,0,47979
Sofia,Female,Black,Parks & Recreation,23,26125,36,10000,38737
Dean,Male,Black,Parks & Recreation,3,33592,44,0,36951
Zach,Male,White,Parks & Recreation,4,37565,50,0,41321


### Exercise 6
<span  style="color:green; font-size:16px"> Give all the white engineers a salary raise of \$10,000 and all the black Parks & Recreation employees a decrease in salary by \$10,000. Highlight the change.</span>

In [46]:
df_orig = df.copy()
white_eng = (df['RACE'] == 'White') & (df['DEPARTMENT'] == 'Engineering')
black_pr = (df['RACE'] == 'Black') & (df['DEPARTMENT'] == 'Parks & Recreation')
df.loc[white_eng, 'SALARY'] += 10000
df.loc[black_pr, 'SALARY'] -= 10000
style_diff(df, df_orig)

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS,TOTAL SALARY
Tom,Male,White,Engineering,23,117962,40,10000,128758
Niko,Male,Black,Engineering,1,30347,30,0,33381
Penelope,Female,White,Engineering,12,70258,35,10000,76283
Aria,Female,Black,Police,8,43618,25,0,47979
Sofia,Female,Black,Parks & Recreation,23,16125,36,10000,38737
Dean,Male,Black,Parks & Recreation,3,23592,44,0,36951
Zach,Male,White,Parks & Recreation,4,37565,50,0,41321


### Exercise 7
<span  style="color:green; font-size:16px"> Use **`.iloc`** to change the age of the employees with integer location 3 and 5 to 60 and 65 respectively. Highlight the change.</span>

In [47]:
df_orig = df.copy()
df.iloc[[3, 5], -3] = [60, 65]
style_diff(df, df_orig)

Unnamed: 0,GENDER,RACE,DEPARTMENT,YEARS EXPERIENCE,SALARY,AGE,BONUS,TOTAL SALARY
Tom,Male,White,Engineering,23,117962,40,10000,128758
Niko,Male,Black,Engineering,1,30347,30,0,33381
Penelope,Female,White,Engineering,12,70258,35,10000,76283
Aria,Female,Black,Police,8,43618,60,0,47979
Sofia,Female,Black,Parks & Recreation,23,16125,36,10000,38737
Dean,Male,Black,Parks & Recreation,3,23592,65,0,36951
Zach,Male,White,Parks & Recreation,4,37565,50,0,41321


# Part 4: How NOT to select subsets of data

In [48]:
import pandas as pd
import numpy as np

### Exercise 1
<span  style="color:green; font-size:16px"> Create a ten item list and then use chained indexing to select the items from 2 to the end and then from this subset, the first three items</span>

In [49]:
a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
a[2:][:3]

[3, 4, 5]

### Exercise 2
<span  style="color:green; font-size:16px"> Get the same result from example 1 with just a single call to the indexing operator.</span>

In [50]:
a[2:5]

[3, 4, 5]

Use the following DataFrame for the next several questions

In [51]:
df = pd.read_csv('../../data/sample_data.csv', index_col=0)
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,8.3
Aaron,FL,red,Mango,12,120,9.0
Penelope,AL,white,Apple,4,80,3.3
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 3
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [52]:
df[['state', 'food', 'color', 'score']]

Unnamed: 0,state,food,color,score
Jane,NY,Steak,blue,4.6
Niko,TX,Lamb,green,8.3
Aaron,FL,Mango,red,9.0
Penelope,AL,Apple,white,3.3
Dean,AK,Cheese,gray,1.8
Christina,TX,Melon,black,9.5
Cornelia,TX,Beans,red,2.2


No, this is the correct way of choosing several columns - with a list of strings.

### Exercise 4
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [53]:
df.loc[['Niko', 'Penelope', 'Dean', 'Cornelia', 'Jane'], ['state', 'food', 'color', 'score']]

Unnamed: 0,state,food,color,score
Niko,TX,Lamb,green,8.3
Penelope,AL,Apple,white,3.3
Dean,AK,Cheese,gray,1.8
Cornelia,TX,Beans,red,2.2
Jane,NY,Steak,blue,4.6


No, this is the correct way to choose rows and columns. There is exactly one call to **`.loc`**. The row selection is simultaneously done with the column selection.

### Exercise 5
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [54]:
df.loc[['Niko', 'Penelope', 'Dean', 'Cornelia', 'Jane']][['state', 'food', 'color', 'score']]

Unnamed: 0,state,food,color,score
Niko,TX,Lamb,green,8.3
Penelope,AL,Apple,white,3.3
Dean,AK,Cheese,gray,1.8
Cornelia,TX,Beans,red,2.2
Jane,NY,Steak,blue,4.6


This is chained indexing. There is one call to **`.loc`** which selects the rows and another call to *just the indexing operator* which selects the columns. The idiomatic solution is the code from exercise 4

### Exercise 6
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [55]:
df.iloc[:5].iloc[2:, 1:4]

Unnamed: 0,color,food,age
Aaron,red,Mango,12
Penelope,white,Apple,4
Dean,gray,Cheese,32


This is chained indexing. There are two separate calls to **`.iloc`**. Fix it like this:

In [56]:
df.iloc[2:5, 1:4]

Unnamed: 0,color,food,age
Aaron,red,Mango,12
Penelope,white,Apple,4
Dean,gray,Cheese,32


### Exercise 7
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [57]:
df[df['state'] == 'TX']['age']

Niko          2
Christina    33
Cornelia     69
Name: age, dtype: int64

This is chained indexing. Use **`.loc`** for simultaneous boolean selection and column selection

In [58]:
df.loc[df['state'] == 'TX', 'age']

Niko          2
Christina    33
Cornelia     69
Name: age, dtype: int64

### Exercise 8
<span  style="color:green; font-size:16px">Determine whether the following line of code is chained indexing. If it is, rewrite it so that it is not.</span>

In [59]:
df[['state', 'food', 'age', 'height']][['state', 'age']].loc[['Jane', 'Aaron'], ['age', 'state']]

Unnamed: 0,age,state
Jane,30,NY
Aaron,12,FL


This has three chains!

In [60]:
df.loc[['Jane', 'Aaron'], ['age', 'state']]

Unnamed: 0,age,state
Jane,30,NY
Aaron,12,FL


### Exercise 9
<span  style="color:green; font-size:16px">Change the score for all people who are over 30 years of age to 99 without doing chained indexing</span>

In [61]:
df.loc[df['age'] < 30, 'score'] = 99
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,99.0
Aaron,FL,red,Mango,12,120,99.0
Penelope,AL,white,Apple,4,80,99.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 10
<span  style="color:green; font-size:16px">Select the **`color`** and **`food`** columns into their own variable.  Then write one more line of code that will trigger the **`SettingWithCopy`** warning</span>

In [63]:
df1 = df[['color', 'food']]
df1

Unnamed: 0,color,food
Jane,blue,Steak
Niko,green,Lamb
Aaron,red,Mango
Penelope,white,Apple
Dean,gray,Cheese
Christina,black,Melon
Cornelia,red,Beans


In [65]:
df1.loc[df['color'] == 'red', 'food'] = ['Watermelon', 'Gyros']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [66]:
df1

Unnamed: 0,color,food
Jane,blue,Steak
Niko,green,Lamb
Aaron,red,Watermelon
Penelope,white,Apple
Dean,gray,Cheese
Christina,black,Melon
Cornelia,red,Gyros


### Exercise 11
<span  style="color:green; font-size:16px">Select the **`color`** and **`food`** columns into their own variable like you did in exercise 10.  Do the same operation as you did without getting the warning.</span>

In [68]:
df1 = df[['color', 'food']].copy()
df1.loc[df['color'] == 'red', 'food'] = ['Watermelon', 'Gyros']
df1

Unnamed: 0,color,food
Jane,blue,Steak
Niko,green,Lamb
Aaron,red,Watermelon
Penelope,white,Apple
Dean,gray,Cheese
Christina,black,Melon
Cornelia,red,Gyros


In [70]:
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,green,Lamb,2,70,99.0
Aaron,FL,red,Mango,12,120,99.0
Penelope,AL,white,Apple,4,80,99.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2


### Exercise 12
<span  style="color:green; font-size:16px">Change the values of **`color`**, **`age`**, and **`score`** for **`Niko`** to anything you like</span>

In [71]:
df.loc['Niko', ['color', 'age', 'score']] = ['RED', 999, -1]
df

Unnamed: 0,state,color,food,age,height,score
Jane,NY,blue,Steak,30,165,4.6
Niko,TX,RED,Lamb,999,70,-1.0
Aaron,FL,red,Mango,12,120,99.0
Penelope,AL,white,Apple,4,80,99.0
Dean,AK,gray,Cheese,32,180,1.8
Christina,TX,black,Melon,33,172,9.5
Cornelia,TX,red,Beans,69,150,2.2
