# Applying Filters

First, run the cell below to import the packages that we will use.

In [2]:
import pandas as pd
import numpy as np
import os 

## Step 1: Inspect the Data


We will be working with the "adult" data set which contains Census information from 1994. 

The code cell below loads the data set by using the Pandas `pd.read_csv()` function to read in the CSV file that contains the data. The file is located in a folder named "data" and has the name "adult.data.partial." The `pd.read_csv()` function returns a Pandas DataFrame. We will assign the data to a DataFrame object called `df`. Run the cell below to load the data.

In [36]:
filename = os.path.join("/Users/salmanyagaka/Documents/interviews/2 Managing Data in ML/Module 2: Create labels and features/adult.data")
df = pd.read_csv(filename, header=0)
df

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32555,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32556,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32557,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32558,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In the code cell below, use the Pandas DataFrame `head()` method to display the first few rows of the DataFrame `df`.

In [37]:
df.head(10)

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
5,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
6,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
7,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
8,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K
9,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K


Use the Pandas `shape` property to display the number of rows and columns in the data. If you forgot the syntax, call `df.shape?` in the cell below to read the documentation.

How many examples (rows) do we have? How many features (columns)?


In [38]:
columns = df.shape[1]
rows = df.shape[0]
print(columns, rows)

15 32560


## Step 2: Random Sampling of the Data
Random sampling from the data using `np.random.choice` and `loc`

We will start by sampling some of the data. You will learn more about sampling in a future exercise. 

For now, imagine that you need to randomly select 30% of the data examples. <br>

First, we will do this the 'NumPy' way.<br>

In the cell below, some code is already pre-written to randomly select 30% of rows and save their indices to variable `indices`.<br> 

The variable `indices` only contains the indices of rows, not the actual data in the rows. 

Complete the code below to obtain only the rows in `df` with the indices  specified in variable `indices`. 

You will recall that you can use `loc[]` to index into a DataFrame to acces rows. Use `loc[]` accomplish this task.

Save this result to a new DataFrame named `df_subset`. 

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [43]:
percentage = 0.3
num_rows = df.shape[0] 
print(int(percentage*num_rows))

#replace=False means it samples without replacement, so no index is selected more than once.

indices = np.random.choice(df.index, size=2100, replace=False)
print(indices)

#creates a new DataFrame df_subset by selecting the rows from df corresponding to the randomly chosen indices.
df_subset = df.loc[indices]

#prints the number of rows in the subset
print(f"Number of rows: {df_subset.shape[0]}")




9768
[ 2739 19082  8802 ...  4005 23981 18856]
Number of rows: 2100


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [7]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testSubset

try:
    p, err = testSubset(df, df_subset)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


Note that you could write some of the code in the cell above in a single line, without creating a new array `indices`, which you likely won't use again. Note how the cell below accomplished that.

In [44]:
percentage = 0.3
num_rows = df.shape[0] 

df_subset = df.loc[np.random.choice(df.index, size=int(percentage*num_rows), replace=False)]

This compressed style may seem a little bit bulky and intimidating at first, but will become easier to comprehend as you get more experience.

Let's check that our sampling worked. You should expect to see that the shape of the new object `df_subset` reflects that it has 30% of the original row number:

In [45]:
print(df.shape) #original number of rows
print(df_subset.shape) #30% of the number of rows


(32560, 15)
(9768, 15)


But did you actually select the rows randomly? Look at the indices in the new DataFrame:

In [46]:
df_subset.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
5791,45,Self-emp-inc,170871,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,15024,0,55,United-States,>50K
19523,21,Private,216181,Some-college,10,Never-married,Sales,Own-child,White,Male,0,0,35,United-States,<=50K
28219,70,Private,237065,5th-6th,3,Widowed,Other-service,Other-relative,White,Female,2346,0,40,?,<=50K
19265,53,Private,133436,7th-8th,4,Divorced,Machine-op-inspct,Not-in-family,White,Female,0,0,40,United-States,<=50K
27926,43,Private,257780,HS-grad,9,Married-civ-spouse,Adm-clerical,Husband,White,Male,0,0,40,United-States,<=50K


It seems random. To convince yourself that it is, try running the sampling code above again, and then re-run the `head()` method to above and inspect the results. You should see a different random sample each time you re-run the sampling code cell.

We will now see how to perform sampling using the `Pandas` way:

In [48]:
percentage = 0.3
num_rows = df.shape[0] 

df_subset = df.sample(int(percentage*num_rows))
df_subset.head()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
13125,37,Self-emp-not-inc,192251,10th,6,Married-civ-spouse,Other-service,Wife,White,Female,2635,0,40,United-States,<=50K
14327,43,Self-emp-not-inc,33521,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Male,0,0,70,United-States,>50K
24829,49,Local-gov,31267,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
30044,36,Private,117381,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,<=50K
22316,40,Private,79586,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,99999,0,40,?,>50K


## Step 3:  Filter a DataFrame by Column Values

Imagine that you want to examine only the private sector employees that we have in DataFrame `df`.  The cell below contains a conditional statement `df['workclass'] =='Private'` 

This will evaluate to a collection of True/False values per row. A value of True indicates that the corresponding row fulfills the condition. This collection of True/False values is of data type Pandas Series (a one-dimensional array). The array is assigned to variable `condition`. 

Run the cell below and inspect the results.

In [66]:
df.columns = df.columns.str.strip()
print(df.columns)
condition = df['Male'] ==' Male' 
condition

Index(['39', 'State-gov', '77516', 'Bachelors', '13', 'Never-married',
       'Adm-clerical', 'Not-in-family', 'White', 'Male', '2174', '0', '40',
       'United-States', '<=50K'],
      dtype='object')


0         True
1         True
2         True
3        False
4        False
         ...  
32555    False
32556     True
32557    False
32558     True
32559    False
Name: Male, Length: 32560, dtype: bool

In the code cell below, use the `condition` variable to extract the private employee sector data from data DataFrame `df`. Hint: Index into `df` using bracket notation and supply it the variable `condition.` Save the results to variable `df_private`. Use the `head()` method to inspect the new DataFrame `df_private`.

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [68]:
df_private = df[condition]
df_private.shape[0]

21789

### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [16]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testPrivate

try:
    p, err = testPrivate(df, df_private, condition)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


How many of the rows are in the new DataFrame `df_private`?<br>

In the cell below, display the number of rows in DataFrame `df_private` using the `shape` property. Save the results to variable `num_rows` and print `num_rows`. Hint: Recall that the `shape` property returns a tuple, with the first value corresponding to the number of rows and the second value corresponding to the number of columns.

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [69]:
num_rows = df_private.shape[0]
print(num_rows)

21789


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [18]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testRows

try:
    p, err = testRows(num_rows)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


## Step 4. Data Analysis using Filtering

The code cell below finds the average age of people who self-reported as female in DataFrame `df`.

In [75]:
condition = df['40'] >=40
df[condition]['40'].mean()

np.float64(45.14780013711336)

Notice that here we do not create a new DataFrame for the filtered data.
Instead, we perform the computation and display the result. 
If you do not anticipate working further with a subset of your DataFrame  
(e.g., querying it or finding more summary statistics about Females), then you don't need to save your results to a new DataFrame object.

As a practice, use the code cell below to play around with the statement that computes the mean: `df[condition]['age'].mean()`:
    
In particular:
- Write `df[condition]` in the cell. run the cell and inspect the results.
- Next, write  `df[condition]['age']` in the cell. Run the cell and inspect the new DataFrame. 
- Next, write `df[condition]['age'].mean()` in the cell. Run the cell and inspect the results.
______

In [22]:
print(df[condition])
print(df[condition]['age'])
print(df[condition]['age'])




      age workclass  fnlwgt     education  education-num      marital-status  \
2      21   Private  270043  Some-college             10       Never-married   
3      45   Private  168837  Some-college             10  Married-civ-spouse   
8      20       NaN  193416  Some-college             10       Never-married   
10     54   Private  155408       HS-grad              9             Widowed   
17     41   Private   56795       Masters             14  Married-civ-spouse   
...   ...       ...     ...           ...            ...                 ...   
6980   57   Private  176079       Masters             14            Divorced   
6985   44   Private  129100          11th              7           Separated   
6986   27   Private  230959     Bachelors             13       Never-married   
6991   17       NaN   80077          11th              7       Never-married   
6996   19   Private  349620          10th              6       Never-married   

             occupation   relationship 

Next you want to know how many people work for the local government for more than 40 hours per week. Using the code above as a guide, in the code cell below:
1. Define the conditions that will find the appropriate data from DataFrame `df`.
2. Apply the condition to DataFrame `df`.
3. Use the `shape` property to obtain the number of rows and assign the results to variable `rows`.

Follow these steps:

1. Create the first condition and name it `condition1`. `condition1` will look for the number of people who work for the local government. Employment information is found in the column `workclass`. The value is `Local-gov`.

2. Create the second condition and name it `condition2`. `condition2` will check whether the number of hours worked per week is more than 40 hours. The number of hours worked can be found in the column `hours-per-week`. 

3. Combine these two conditions using the `&` operator to create a compound statement. Assign that to variable `condition` (`condition = condition1 & condition2`).

4. Apply `condition` to DataFrame `df` using bracket notation, and save the result to DataFrame`df_local`. 

5. Use the `shape` property to obtain the number of rows in `df_local`. Assign the result to variable `rows`.

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [23]:
condition1 = df['workclass'] == 'Local-gov'
condition2 = df['hours-per-week'] > 40
condition = condition1 & condition2
df_local = df[condition]
rows = df_local.shape[0]

### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [24]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testCondition

try:
    p, err = testCondition(df, condition1, condition2, condition, df_local, rows)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


Sometimes your data may contain missing values. One such column that contains missing values in DataFrame `df` is `native-country`. Not everyone's native country has been supplied. Such columns contain the value `Nan`.

The code cell below randomly samples 50% of rows for which the native country information is available and ignores missing values. It uses pandas `notnull()` method. You can read more about `notnull()` in the online [documentation](https://pandas.pydata.org/docs/reference/api/pandas.notnull.html).

In [25]:
percentage = 0.5

# obtain all rows in which the column 'native-country' contains a value
df_country_notnull = df[df['native-country'].notnull()]

# obtain the number of rows in df_country_notnull
num_rows = df_country_notnull.shape[0]

# obtain a 50% random sample of rows from df_country_notnull and save the indices of these rows
indices = np.random.choice(df_country_notnull.index, size=int(percentage*num_rows), replace=False)

# using the row indices, save these row values to new DataFrame df_filtered
df_filtered = df_country_notnull.loc[indices]


In the code cell below, find the mean age of individuals in DataFrame `df_filtered` and save the value to variable `mean_age`.

### Graded Cell

The cell below will be graded. Remove the line "raise NotImplementedError()" before writing your code.

In [26]:
mean_age = df_filtered['age'].mean()


### Self-Check

Run the cell below to test the correctness of your code above before submitting for grading. Do not add code or delete code in the cell.

In [27]:
# Run this self-test cell to check your code; 
# do not add code or delete code in this cell
from jn import testMean
try:
    p, err = testMean(df_filtered, mean_age)
    print(err)
except Exception as e:
    print("Error!\n" + str(e))
    


Correct!


You have been selecting a single column (e.g., 'age') by using bracket notation `df_filtered['age']`. You will sometimes also encounter columns being selected using dot notation `df_filtered.age`. Note that this won't work if the column name includes hyphens or any other special symbols. We will stick to providing names as strings in square brackets.