### 1. Import and load

In [2]:
import pandas as pd
import numpy as np

nls97 = pd.read_csv('data/nls97.csv')
nls97.set_index('personid', inplace=True)

### 2. Use slicing to start at the 1001st row and go to the 1004th row:

`nls97[1000:10004]` select every row starting from the row indicated by the integer to the left of the colon (`1000`, in this case) to, but not including, the row indicated by the integer to the right of the colon (`1004`). The row at `1000` is actually 10001st row because of zero-based indexing. Each row appears as a column in the output since we have transposed the resulting `DataFrame`: 

In [3]:
nls97[1000:10004].T

personid,195884,195891,195970,195996,196121,196739,196773,196956,196969,196986,...,998997,999031,999053,999087,999103,999291,999406,999543,999698,999963
gender,Male,Male,Female,Female,Male,Female,Male,Female,Male,Female,...,Female,Female,Female,Male,Female,Female,Male,Female,Female,Female
birthmonth,12,9,3,9,1,1,10,7,12,4,...,7,12,12,12,6,4,7,8,5,9
birthyear,1981,1980,1982,1980,1981,1980,1982,1983,1983,1981,...,1983,1980,1981,1980,1981,1981,1982,1984,1983,1982
highestgradecompleted,,12.0,17.0,,19.0,11.0,,,18.0,12.0,...,9.0,,,12.0,12.0,16.0,14.0,12.0,12.0,17.0
maritalstatus,,Never-married,Never-married,,Never-married,Never-married,,,Married,Married,...,Married,,,Never-married,Divorced,Married,Never-married,Divorced,Never-married,Married
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
colenroct15,,1. Not enrolled,1. Not enrolled,,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,4. Graduate program,1. Not enrolled,...,1. Not enrolled,1. Not enrolled,,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb16,,1. Not enrolled,1. Not enrolled,,1. Not enrolled,1. Not enrolled,,1. Not enrolled,4. Graduate program,1. Not enrolled,...,1. Not enrolled,,,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenroct16,,1. Not enrolled,1. Not enrolled,,1. Not enrolled,1. Not enrolled,,,4. Graduate program,1. Not enrolled,...,1. Not enrolled,,,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb17,,1. Not enrolled,1. Not enrolled,,1. Not enrolled,1. Not enrolled,,,4. Graduate program,1. Not enrolled,...,1. Not enrolled,,,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled,1. Not enrolled


### 3. Select a few rows using the `loc` data accessor.

Use the `loc` accessor to select by `index` label. We can pass a list of index labels or we can specify a range of labels. 

Note that `nls97.loc[[195884,195891,195970]]` and `nls97.loc[195884:195970]` return the same `DataFrame`:

In [4]:
nls97.loc[[195884,195891,195970]].T

personid,195884,195891,195970
gender,Male,Male,Female
birthmonth,12,9,3
birthyear,1981,1980,1982
highestgradecompleted,,12.0,17.0
maritalstatus,,Never-married,Never-married
...,...,...,...
colenroct15,,1. Not enrolled,1. Not enrolled
colenrfeb16,,1. Not enrolled,1. Not enrolled
colenroct16,,1. Not enrolled,1. Not enrolled
colenrfeb17,,1. Not enrolled,1. Not enrolled


In [5]:
nls97.loc[195884:195970].T

personid,195884,195891,195970
gender,Male,Male,Female
birthmonth,12,9,3
birthyear,1981,1980,1982
highestgradecompleted,,12.0,17.0
maritalstatus,,Never-married,Never-married
...,...,...,...
colenroct15,,1. Not enrolled,1. Not enrolled
colenrfeb16,,1. Not enrolled,1. Not enrolled
colenroct16,,1. Not enrolled,1. Not enrolled
colenrfeb17,,1. Not enrolled,1. Not enrolled


### 4. Select a row from the beginning of the DataFrame with the `iloc` data accessor.

`iloc` differs from `loc` in that it takes a list of row position integers, rather than index labels. For that reason, it works similarly to bracket operator slicing. In this step, we first pass a one-item list with the value of `0`. That returns a DataFrame with the first row:

In [7]:
nls97.iloc[[0]].T

personid,100061
gender,Female
birthmonth,5
birthyear,1980
highestgradecompleted,13.0
maritalstatus,Married
...,...
colenroct15,1. Not enrolled
colenrfeb16,1. Not enrolled
colenroct16,1. Not enrolled
colenrfeb17,1. Not enrolled


### 5. Select a few rows from the beginning of the DataFrame with the `iloc` data accessor.

We pass a three-item list, `[0,1,2]` to return a DataFrame of the first three rows of `nls97`. We would get the same result if we passed `[0:3]` to the accessor:

In [8]:
nls97.iloc[[0,1,2]].T

personid,100061,100139,100284
gender,Female,Male,Male
birthmonth,5,9,11
birthyear,1980,1983,1984
highestgradecompleted,13.0,12.0,7.0
maritalstatus,Married,Married,Never-married
...,...,...,...
colenroct15,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenroct16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb17,1. Not enrolled,1. Not enrolled,1. Not enrolled


In [10]:
nls97.iloc[0:3].T #only 1 bracket

personid,100061,100139,100284
gender,Female,Male,Male
birthmonth,5,9,11
birthyear,1980,1983,1984
highestgradecompleted,13.0,12.0,7.0
maritalstatus,Married,Married,Never-married
...,...,...,...
colenroct15,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenroct16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb17,1. Not enrolled,1. Not enrolled,1. Not enrolled


### 5. Select a few rows from the end of the DataFrame with `iloc` data accessor.

Use `nls97.iloc[[-3,-2,-1]]`, and `nls97[-3:]` to retrieve the last three rows of the DataFrame.

By not providing a value to the right of the colon in `[-3:]`, we are telling the accessor to get all rows from the third-to-last row to the end of the DataFrame:

In [11]:
nls97.iloc[[-3, -2, -1]].T

personid,999543,999698,999963
gender,Female,Female,Female
birthmonth,8,5,9
birthyear,1984,1983,1982
highestgradecompleted,12.0,12.0,17.0
maritalstatus,Divorced,Never-married,Married
...,...,...,...
colenroct15,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenroct16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb17,1. Not enrolled,1. Not enrolled,1. Not enrolled


In [12]:
nls97.iloc[-3:].T

personid,999543,999698,999963
gender,Female,Female,Female
birthmonth,8,5,9
birthyear,1984,1983,1982
highestgradecompleted,12.0,12.0,17.0
maritalstatus,Divorced,Never-married,Married
...,...,...,...
colenroct15,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenroct16,1. Not enrolled,1. Not enrolled,1. Not enrolled
colenrfeb17,1. Not enrolled,1. Not enrolled,1. Not enrolled


### 6. Select multiple rows conditionnaly using boolean indexing.

Create a DataFrame of just individuals receiving very little sleep. About 5% of survey respondents got 4 or fewer hours sleep per night, of the 6706 individuals who responded to that question. 

Test who is getting 4 or fewer hours of sleep with `nls97.nightlyhrssleep <= 4`, which generates a pandas series of True and False values that we assign to `sleep_check_bool`.

Pass that series to the loc accessor to create a `lows_leep` DataFrame. 

`low_sleep` has approx. the number of rows we are expecting. We do not need to do extra step of assigning the boolean series to a variable. This is done here only for explanatory purposes:

In [13]:
nls97.nightlyhrssleep.quantile(0.05) # 5%

4.0

In [15]:
nls97.nightlyhrssleep.count()

6706

In [17]:
sleep_check_bool = nls97.nightlyhrssleep <= 4
sleep_check_bool

personid
100061    False
100139    False
100284    False
100292    False
100583    False
          ...  
999291    False
999406    False
999543    False
999698    False
999963    False
Name: nightlyhrssleep, Length: 8984, dtype: bool

In [19]:
type(sleep_check_bool)

pandas.core.series.Series

In [18]:
low_sleep = nls97.loc[sleep_check_bool]
low_sleep.shape

(364, 88)

### 7. Select rows based on multiple conditions.

It may be that folks who are not getting a lot of sleep also have fair number of children who live with them. 

Use `describe` to get a sense of the distribution of the number of children for those who have `low_sleep`.

About a quarter have three or more children. 

Create a new DataFrame with individuals who have `nightlyhrssleep` of 4 or less and the number of children at home of 3 or more.

The `&` is the logical and operator in pandas and indicates that both conditions have to be true for the row to be selected. (We would have gotten the same result if we worked from the `low_sleep` DataFrame - `low_sleep_3_plus_children = low_sleep[low_sleep.childathome >= 3]` - but then we would not have been able to demonstrate testing multiple conditions):

In [20]:
low_sleep.childathome.describe()

count    293.000000
mean       1.788396
std        1.400685
min        0.000000
25%        1.000000
50%        2.000000
75%        3.000000
max        9.000000
Name: childathome, dtype: float64

In [24]:
low_sleep_3_plus_children = nls97.loc[(nls97.nightlyhrssleep <= 4) & (nls97.childathome >= 3)]
low_sleep_3_plus_children.shape

(82, 88)

### 8. Select rows and columns based on multiple conditions.

Pass the condition to the `loc` accessor to slect rows. Also, pass a list of column names to select:

In [25]:
low_sleep_3_plus_children = nls97.loc[(nls97.nightlyhrssleep <= 4) & (nls97.childathome >= 3), ['nightlyhrssleep', 'childathome']]
low_sleep_3_plus_children

Unnamed: 0_level_0,nightlyhrssleep,childathome
personid,Unnamed: 1_level_1,Unnamed: 2_level_1
119754,4.0,4.0
141531,4.0,5.0
152706,4.0,4.0
156823,1.0,3.0
158355,4.0,4.0
...,...,...
905774,4.0,3.0
907315,4.0,3.0
955166,3.0,3.0
956100,4.0,6.0
