In [1]:
import numpy as np
import pandas as pd

## Selecting Data from `DataFrame` Objects

Similiarly to what we found with `Series` objects. You can interact with `DataFrame` objects in ways that sometimes resemble a dictionary and other times a NumPy array.

In [2]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1',
    index_col = 'institution_name')
college_scorecard.head()

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,


### Masking

Masking operations likewise return rows from a `DataFrame`, but the **criteria of the masks will be a comparison on one of the columns/Series**. This is somewhat confusing sounding, so let's just demonstrate:

In [5]:
mask_ak = college_scorecard['state'] == 'AK'
mask_ak

institution_name
Alaska Bible College                       True
Alaska Career College                      True
Alaska Christian College                   True
Alaska Pacific University                  True
AVTEC-Alaska's Institute of Technology     True
                                          ...  
Northwest College                         False
Sheridan College                          False
University of Wyoming                     False
Western Wyoming Community College         False
Wyotech-Laramie                           False
Name: state, Length: 7282, dtype: bool

In [6]:
# Return all rows where the 'state' Series has a value of 'AK'
college_scorecard[ mask_ak]

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Alaska Bible College,102580,884300,8843,Palmer,AK,www.akbible.edu/,3,Bachelors,2,PrivateNonProfit,...,0.3571,0.3333,,,,0.2857,,PrivacySuppressed,,
Alaska Career College,103501,2541000,25410,Anchorage,AK,www.alaskacareercollege.edu,1,Certificate,3,PrivateForProfit,...,0.7078,,0.7941,,,0.786,28700.0,8994,0.707589494,
Alaska Christian College,442523,4138600,41386,Soldotna,AK,www.alaskacc.edu,1,Certificate,2,PrivateNonProfit,...,0.8868,,0.4737,,1.0,0.6792,,PrivacySuppressed,0.0,
Alaska Pacific University,102669,106100,1061,Anchorage,AK,www.alaskapacific.edu,3,Bachelors,2,PrivateNonProfit,...,0.3152,0.7742,,1.0,,0.5297,47000.0,23250,,0.514833663
AVTEC-Alaska's Institute of Technology,102711,3160300,31603,Seward,AK,www.avtec.edu/,1,Certificate,1,Public,...,0.0737,,1.0,,1.0,0.0664,33500.0,PrivacySuppressed,0.846055789,
Charter College-Anchorage,102845,2576900,25769,Anchorage,AK,www.chartercollege.edu,1,Certificate,3,PrivateForProfit,...,0.8307,,,,,0.7503,39200.0,13875,,0.400148336
Ilisagvik College,434584,3461300,34613,Barrow,AK,www.ilisagvik.edu,1,Certificate,1,Public,...,0.1323,,0.8095,,0.3333,0.0,24900.0,PrivacySuppressed,0.340906818,
University of Alaska Anchorage,102553,1146200,11462,Anchorage,AK,www.uaa.alaska.edu,3,Bachelors,1,Public,...,0.2385,0.7164,,0.4549,,0.2647,42500.0,19449.5,,0.252541205
University of Alaska Fairbanks,102614,106300,1063,Fairbanks,AK,www.uaf.edu,3,Bachelors,1,Public,...,0.2263,0.7756,,0.4857,,0.255,36200.0,19355,,0.315570823
University of Alaska Southeast,102632,106500,1065,Juneau,AK,www.uas.alaska.edu,1,Certificate,1,Public,...,0.1769,0.7167,,0.6364,,0.1996,37400.0,16875,,0.156750746


In [14]:
# Which colleges in IN offer Bachelors degrees?
# Again, notice the parathesis here
# Also, notice that I'm assigning it to a variable so that I can use it later
colleges_IN_Bachelors = college_scorecard[(college_scorecard['state'] == 'IN') & 
                                          (college_scorecard['predominant_degree_desc'] == 'Bachelors')]

**NOTE**: You can break down the right hand side of the assignment into two lines for readability of the code. 

In [15]:
colleges_IN_Bachelors

Unnamed: 0_level_0,UNITID,OPEID,OPEID6,city,state,url,predominant_degree_code,predominant_degree_desc,institutional_owner_code,institutional_owner_desc,...,pell_grant_receipents,full_time_retention_rate_4_year,full_time_retention_rate_less_than_4_year,part_time_rentention_rate_4_year,part_time_rentention_rate_less_than_4_year,students_with_federal_loans,median_student_earnings,median_student_debt,less_than_4_year_school_completion_rate,4_year_school_completion_rate
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Anderson University,150066,178500,1785,Anderson,IN,www.anderson.edu,3,Bachelors,2,PrivateNonProfit,...,0.2118,0.7465,,0.5,,0.4688,35600.0,27000,,0.599595878
Ball State University,150136,178600,1786,Muncie,IN,www.bsu.edu,3,Bachelors,1,Public,...,0.3399,0.8141,,0.5,,0.5917,38800.0,25000,,0.594160686
Bethel College-Indiana,150145,178700,1787,Mishawaka,IN,www.bethelcollege.edu,3,Bachelors,2,PrivateNonProfit,...,0.5106,0.8122,,0.0,,0.7631,34900.0,PrivacySuppressed,,0.672081466
Butler University,150163,178800,1788,Indianapolis,IN,www.butler.edu,3,Bachelors,2,PrivateNonProfit,...,0.1649,0.9014,,0.0,,0.5742,55000.0,27000,,0.757950963
Calumet College of Saint Joseph,150172,183400,1834,Whiting,IN,www.ccsj.edu,3,Bachelors,2,PrivateNonProfit,...,0.4351,0.5772,,0.3684,,0.568,38900.0,20293.5,,0.274751351
Chamberlain College of Nursing-Indiana,475741,638510,6385,Indianapolis,IN,www.chamberlain.edu,3,Bachelors,3,PrivateForProfit,...,0.4843,0.75,,,,0.8931,52600.0,24581,,
DePauw University,150400,179200,1792,Greencastle,IN,www.depauw.edu,3,Bachelors,2,PrivateNonProfit,...,0.1944,0.9267,,,,0.5551,47700.0,25000,,0.796709494
DeVry University-Indiana,482486,1072747,10727,Merrillville,IN,www.devry.edu,3,Bachelors,3,PrivateForProfit,...,0.5817,0.7778,,0.4167,,0.8213,,40150,,PrivacySuppressed
Earlham College,150455,179300,1793,Richmond,IN,www.earlham.edu,3,Bachelors,2,PrivateNonProfit,...,0.2829,0.8419,,,,0.5414,33400.0,26840,,0.712695987
Franklin College,150604,179800,1798,Franklin,IN,www.franklincollege.edu,3,Bachelors,2,PrivateNonProfit,...,0.3895,0.7941,,,,0.7722,40800.0,27000,,0.590159016


In [16]:
colleges_IN_Bachelors.shape[0]

50

### Selecting Multiple Columns of DataFrame

In [17]:
two_columns = college_scorecard[ ['state', 'predominant_degree_desc'] ]
two_columns.head()

Unnamed: 0_level_0,state,predominant_degree_desc
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Alaska Bible College,AK,Bachelors
Alaska Career College,AK,Certificate
Alaska Christian College,AK,Certificate
Alaska Pacific University,AK,Bachelors
AVTEC-Alaska's Institute of Technology,AK,Certificate


**NOTE**: Among the two sets of square brackets `[[ ]]`, the first set is used to select the columns, the second set is used to list the columns you want to select. 

In [19]:
list_of_cols = ['state', 'predominant_degree_desc', 'city']
two_columns = college_scorecard[ list_of_cols ]
two_columns.head()

Unnamed: 0_level_0,state,predominant_degree_desc,city
institution_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Alaska Bible College,AK,Bachelors,Palmer
Alaska Career College,AK,Certificate,Anchorage
Alaska Christian College,AK,Certificate,Soldotna
Alaska Pacific University,AK,Bachelors,Anchorage
AVTEC-Alaska's Institute of Technology,AK,Certificate,Seward


## Activity On Football Athletes Data

1. Select the players who are in freshmen class and assign it to a variable. How many such players are there? 
1. Select the players players whose position is wide receiver (WR) and are in their junior class, and assign it to a variable. How many such players are there? 
1. Find the average height of players whose position is wide receiver (WR) and are in their junior class. 
1. Select only two columns the height and weight of the players, and then display only players whose weight is below 185 lbs.   

In [20]:
athletes_data = pd.read_csv('./data/nd-football-2021-roster.csv', index_col=['Name'])
athletes_data.head()

Unnamed: 0_level_0,Number,Position,Height,Weight,Class,Hometown
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Tyler Buchner,12,QB,73,207,FR,"San Diego, CA"
Cole Capen,8,QB,76,227,JR,"Yorba Linda, CA"
Brendon Clark,7,QB,73,225,SO,"Midlothian, VA"
Jack Coan,17,QB,75,223,SR,"Sayville, NY"
Ron Powlus III,11,QB,74,215,FR,"Granger, IN"


In [22]:
mask_fr = athletes_data['Class'] == 'FR'
mask_fr

Name
Tyler Buchner      True
Cole Capen        False
Brendon Clark     False
Jack Coan         False
Ron Powlus III     True
                  ...  
Jay Bramblett     False
Jake Rittman      False
Alex Peitsch       True
Axel Raarup       False
Michael Vinson    False
Name: Class, Length: 113, dtype: bool

In [24]:
athletes_fr = athletes_data[mask_fr]
athletes_fr.shape[0]

45

In [25]:
athletes_data[ (athletes_data['Class'] == 'FR') ].shape[0]

45

In [28]:
mask_wr_jr = (athletes_data['Position'] == 'WR' ) & (athletes_data['Class'] == 'JR')
athletes_wr_jr = athletes_data[mask_wr_jr]
athletes_wr_jr.shape[0]

6

In [29]:
athletes_wr_jr

Unnamed: 0_level_0,Number,Position,Height,Weight,Class,Hometown
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Avery Davis,3,WR,71,202,JR,"Cedar Hill, TX"
Lawrence Keys III,13,WR,70,176,JR,"New Orleans, LA"
Braden Lenzy,0,WR,71,181,JR,"Tigard, OR"
Greg Mailey,43,WR,73,203,JR,"Hudson, OH"
Matt Salerno,29,WR,72,199,JR,"Valencia, CA"
Joe Wilkins Jr.,5,WR,73,195,JR,"North Fort Myers, FL"


In [31]:
# Pandas way of computing the mean
athletes_wr_jr['Height'].mean()

71.66666666666667

In [34]:
list_of_cols = ['Height', 'Weight']
athletes_h_w = athletes_data[list_of_cols]
mask_185 = athletes_h_w['Weight'] <185 
athletes_h_w[ mask_185 ]

Unnamed: 0_level_0,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Henry Cook,70,182
Lawrence Keys III,70,176
Braden Lenzy,71,181
Jack Polian,72,174
Conor Ratigan,71,179
Ryan Barnes,73,182
TaRiq Bracy,70,177
Chance Tucker,71,183
Chris Salerno,71,183


In [35]:
## All in one line of code
athletes_data[['Height','Weight']][(athletes_data['Weight']<185)]

Unnamed: 0_level_0,Height,Weight
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Henry Cook,70,182
Lawrence Keys III,70,176
Braden Lenzy,71,181
Jack Polian,72,174
Conor Ratigan,71,179
Ryan Barnes,73,182
TaRiq Bracy,70,177
Chance Tucker,71,183
Chris Salerno,71,183


# Handling Missing Data

In [36]:
val1 = None

val1 is None

True

In [37]:
val1*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [38]:
vals1 = np.array([1,None, 3, 4])
vals1*5

TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

In [39]:
vals1

array([1, None, 3, 4], dtype=object)

In [40]:
np.sum(vals1)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### NaN: Missing numerical data

NaN stands for Not-a-Number

In [41]:
vals1 = np.array([1,np.nan, 3, 4])
vals1*5

array([ 5., nan, 15., 20.])

In [42]:
vals1

array([ 1., nan,  3.,  4.])

In [43]:
vals1.dtype

dtype('float64')

In [44]:
np.sum(vals1)

nan

**Sum of any true number and a nan is a nan**

### np.nansum

Used to treat nan as a zero in adding the elements of the array

In [45]:
np.nansum(vals1)

8.0

In [46]:
np.nanmedian(vals1)

3.0

### NaN and None in pandas

Pandas converts both NaN and None as NaN

In [47]:
simple_series = pd.Series([1,np.nan, 2, None])

In [48]:
simple_series

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

In [49]:
simple_series.sum()

3.0

## Operating on Null Values

The following functions help in detecting and handling the null values in Pandas package

| Ufunc for missing values              | Description |                         
|---------------------|----------------------------------------------------------|
|``isnull()``          |Generate a Boolean mask indicating missing values         |
|``notnull()``      |Opposite of isnull()                                      |
|``dropna()``           |Return a filtered version of the data                     |
|``fillna()``         |Return a copy of the data with missing values filled      |


In [50]:
simple_data = pd.Series([1,np.nan, 'Hello', None])
simple_data

0        1
1      NaN
2    Hello
3     None
dtype: object

In [51]:
simple_data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [52]:
~simple_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [53]:
simple_data[~simple_data.isnull()]

0        1
2    Hello
dtype: object

In [54]:
simple_data[simple_data.notnull()]

0        1
2    Hello
dtype: object

In [55]:
simple_data.dropna()

0        1
2    Hello
dtype: object

In [56]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,    6]])

df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [57]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


In [58]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


<div class="alert alert-block alert-info">
<p>
There are other optional parameters that are offered by the ``dropna()`` function on dataframe, like, ``how`` and ``thresh``. **Look at Page 126 of the textbook for more details.** </p>
</div> 

In [59]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [60]:
df.fillna(100)

Unnamed: 0,0,1,2
0,1.0,100.0,2
1,2.0,3.0,5
2,100.0,4.0,6


In [61]:
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


In [62]:
df = df.fillna(100)

In [63]:
df

Unnamed: 0,0,1,2
0,1.0,100.0,2
1,2.0,3.0,5
2,100.0,4.0,6


<div class="alert alert-block alert-info">
<p>
There are other optional parameter called method that are offered by the ``fillna()`` function on dataframe, like, ``method='ffill'`` and ``method='bfill'``. **Look at Page 127 of the textbook for more details.** </p>

<p>
**Also, read other important keyword argument ``inplace``. What happens when it is set to `False` and `True`? **
</p>
</div> 

## Working with dataset with missing values

Marketing dataset: This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics attributes. [Source](http://sci2s.ugr.es/keel/dataset.php?cod=163)

[Data Dictionary](http://sci2s.ugr.es/keel/dataset/data/classification/marketing-names.txt)

In [64]:
mark_data = pd.read_csv('./data/marketing.csv')

In [65]:
mark_data.head()

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,2,1.0,5,4.0,5.0,5.0,3,3.0,0,1.0,1.0,7.0,,9
1,1,1.0,5,5.0,5.0,5.0,3,5.0,2,1.0,1.0,7.0,1.0,9
2,2,1.0,3,5.0,1.0,5.0,2,3.0,1,2.0,3.0,7.0,1.0,9
3,2,5.0,1,2.0,6.0,5.0,1,4.0,2,3.0,1.0,7.0,1.0,1
4,2,5.0,1,2.0,6.0,3.0,1,4.0,2,3.0,1.0,7.0,1.0,1


In [70]:
mark_data.sample(5)

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
329,2,5.0,2,4.0,5.0,5.0,1,5.0,0,3.0,1.0,7.0,1.0,8
4708,1,5.0,2,2.0,6.0,3.0,1,4.0,3,3.0,,7.0,,1
4575,2,5.0,2,4.0,2.0,5.0,1,3.0,1,3.0,3.0,7.0,1.0,4
7907,2,1.0,4,4.0,5.0,5.0,3,4.0,2,1.0,1.0,7.0,,8
2020,2,5.0,3,3.0,4.0,,1,3.0,2,2.0,3.0,3.0,1.0,2


### Activity:

* How many total responders in the dataset? 


In [71]:
mark_data.shape

(8993, 14)

In [72]:
mark_data.shape[0]

8993

In [73]:
len(mark_data)

8993


* How many missing values are in the following two columns? 
  * Age column
  * MaritalStatus column


In [78]:
mark_data['Age'].isnull().sum()

0

In [79]:
mark_data['MaritalStatus'].isnull().sum()

160

**NOTE**: In the above scenario, we are analyzing each column one at a time. Below we can see how we can work with all columns at a same time

* How many missing values are there in each of the columns of the dataset? 

In [80]:
mark_data.isnull()

Unnamed: 0,Sex,MaritalStatus,Age,Education,Occupation,YearsInSf,DualIncome,HouseholdMembers,Under18,HouseholdStatus,TypeOfHome,EthnicClass,Language,Income
0,False,False,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8988,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8989,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8990,False,False,False,False,False,False,False,False,False,False,False,False,False,False
8991,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [81]:
mark_data.isnull().sum()

Sex                   0
MaritalStatus       160
Age                   0
Education            86
Occupation          136
YearsInSf           913
DualIncome            0
HouseholdMembers    375
Under18               0
HouseholdStatus     240
TypeOfHome          357
EthnicClass          68
Language            359
Income                0
dtype: int64


* What percentage of missing values for each column in the dataset? 


In [84]:
perc_miss_per_col = 100*mark_data.isnull().mean()
perc_miss_per_col

Sex                  0.000000
MaritalStatus        1.779162
Age                  0.000000
Education            0.956299
Occupation           1.512287
YearsInSf           10.152341
DualIncome           0.000000
HouseholdMembers     4.169910
Under18              0.000000
HouseholdStatus      2.668742
TypeOfHome           3.969754
EthnicClass          0.756144
Language             3.991994
Income               0.000000
dtype: float64



* Which attribute has the most missing values in the dataset? (**Hint**: To get the index of the maximum element you can use [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.idxmax.html) function)



In [85]:
perc_miss_per_col.max()

10.152340709440676

In [86]:
perc_miss_per_col.idxmax()

'YearsInSf'


* How do you fill the missing values with a `0`? 



* **Most Common Use**: Can you fill each missing value with the corresponding average for that attribute? 
    * For example, if 'Education' attribute is missing for a person, can you find the average 'Education' of all people and fill that missing 'Age' with that average. 