In [None]:
import numpy as np
import pandas as pd

## Selecting Data from `DataFrame` Objects

Similiarly to what we found with `Series` objects. You can interact with `DataFrame` objects in ways that sometimes resemble a dictionary and other times a NumPy array.

In [None]:
college_scorecard = pd.read_csv(
    './data/college-scorecard-data-scrubbed.csv', 
    encoding='latin-1',
    index_col = 'institution_name')
college_scorecard.head()

### Masking

Masking operations likewise return rows from a `DataFrame`, but the **criteria of the masks will be a comparison on one of the columns/Series**. This is somewhat confusing sounding, so let's just demonstrate:

In [None]:
college_scorecard['state'] == 'AK'

In [None]:
# Return all rows where the 'state' Series has a value of 'AK'
college_scorecard[ college_scorecard['state'] == 'AK']

In [None]:
# Which colleges in IN offer Bachelors degrees?
# Again, notice the parathesis here
# Also, notice that I'm assigning it to a variable so that I can use it later
colleges_IN_Bachelors = college_scorecard[(college_scorecard['state'] == 'IN') & 
                                          (college_scorecard['predominant_degree_desc'] == 'Bachelors')]

**NOTE**: You can break down the right hand side of the assignment into two lines for readability of the code. 

In [None]:
colleges_IN_Bachelors

In [None]:
colleges_IN_Bachelors.shape[0]

### Selecting Multiple Columns of DataFrame

In [None]:
two_columns = college_scorecard[ ['state', 'predominant_degree_desc'] ]
two_columns.head()

**NOTE**: Among the two sets of square brackets `[[ ]]`, the first set is used to select the columns, the second set is used to list the columns you want to select. 

## Activity On Football Athletes Data

1. Select the players who are in freshmen class and assign it to a variable. How many such players are there? 
1. Select the players players whose position is wide receiver (WR) and are in their the senior class, and assign it to a variable. How many such players are there? 
1. Find the average height of players whose position is wide receiver (WR) and are in their the senior class. 
1. Select only two columns the height and weight of the players, and then display only players whose weight is below 185 lbs.   

In [None]:
athletes_data = pd.read_csv('./data/nd-football-2019-roster.csv', index_col=['Name'])
athletes_data.head()

# Handling Missing Data

In [None]:
val1 = None

val1 is None

In [None]:
val1*5

In [None]:
vals1 = np.array([1,None, 3, 4])
vals1*5

In [None]:
vals1

In [None]:
np.sum(vals1)

### NaN: Missing numerical data

NaN stands for Not-a-Number

In [None]:
vals1 = np.array([1,np.nan, 3, 4])
vals1*5

In [None]:
vals1

In [None]:
vals1.dtype

In [None]:
np.sum(vals1)

**Sum of any true number and a nan is a nan**

### np.nansum

Used to treat nan as a zero in adding the elements of the array

In [None]:
np.nansum(vals1)

In [None]:
np.nanmedian(vals1)

### NaN and None in pandas

Pandas converts both NaN and None as NaN

In [None]:
simple_series = pd.Series([1,np.nan, 2, None])

In [None]:
simple_series

In [None]:
simple_series.sum()

## Operating on Null Values

The following functions help in detecting and handling the null values in Pandas package

| Ufunc for missing values              | Description |                         
|---------------------|----------------------------------------------------------|
|``isnull()``          |Generate a Boolean mask indicating missing values         |
|``notnull()``      |Opposite of isnull()                                      |
|``dropna()``           |Return a filtered version of the data                     |
|``fillna()``         |Return a copy of the data with missing values filled      |


In [None]:
simple_data = pd.Series([1,np.nan, 'Hello', None])
simple_data

In [None]:
simple_data.isnull()

In [None]:
~simple_data.isnull()

In [None]:
simple_data[~simple_data.isnull()]

In [None]:
simple_data[simple_data.notnull()]

In [None]:
simple_data.dropna()

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,    6]])

df

In [None]:
df.dropna()

In [None]:
df.dropna(axis='columns')

<div class="alert alert-block alert-info">
<p>
There are other optional parameters that are offered by the ``dropna()`` function on dataframe, like, ``how`` and ``thresh``. **Look at Page 126 of the textbook for more details.** </p>
</div> 

In [None]:
df

In [None]:
df.fillna(100)

In [None]:
df

In [None]:
df = df.fillna(100)

In [None]:
df

<div class="alert alert-block alert-info">
<p>
There are other optional parameter called method that are offered by the ``fillna()`` function on dataframe, like, ``method='ffill'`` and ``method='bfill'``. **Look at Page 127 of the textbook for more details.** </p>

<p>
**Also, read other important keyword argument ``inplace``. What happens when it is set to `False` and `True`? **
</p>
</div> 

## Working with dataset with missing values

Marketing dataset: This dataset contains questions from questionaries that were filled out by shopping mall customers in the San Francisco Bay area. The goal is to predict the Anual Income of Household from the other 13 demographics attributes. [Source](http://sci2s.ugr.es/keel/dataset.php?cod=163)

[Data Dictionary](http://sci2s.ugr.es/keel/dataset/data/classification/marketing-names.txt)

In [None]:
mark_data = pd.read_csv('./data/marketing.csv')

In [None]:
mark_data.head()

In [None]:
mark_data.sample(5)

### Activity:

* How many total responders in the dataset? 



* How many missing values are in the following two columns? 
  * Age column
  * MaritalStatus column


**NOTE**: In the above scenario, we are analyzing each column one at a time. Below we can see how we can work with all columns at a same time

* How many missing values are there in each of the columns of the dataset? 


* What percentage of missing values for each column in the dataset? 




* Which attribute has the most missing values in the dataset? (**Hint**: To get the index of the maximum element you can use [`idxmax()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.idxmax.html) function)




* How do you fill the missing values with a `0`? 



* **Most Common Use**: Can you fill each missing value with the corresponding average for that attribute? 
    * For example, if 'Education' attribute is missing for a person, can you find the average 'Education' of all people and fill that missing 'Age' with that average. 

# Combining Pandas Datasets with Concatenation [MORE INFO](https://pandas.pydata.org/pandas-docs/stable/merging.html)

## Introduction

In [None]:
# For this tutorial, we will need college_loan_defaults dataset.
college_loan_defaults = pd.read_csv(
    './data/college-loan-default-rates.csv', index_col='opeid')

# Keep in mind that the original dataset has this many rows
len(college_loan_defaults)

The Office of Postsecondary Education Identification (OPEID) code for each college is used as an index

In [None]:
college_loan_defaults.head()

## `pd.concat`
You can think of the `pd.concat` function as the equivalent of the NumPy `concatenate` function for `Series` and `DataFrame` objects.

Will we spend most of our time on how these function works with `DataFrame` objects as opposed to `Series` objects since in practice that is how it is used most frequently.

When it comes to using the `pd.concat` function, the most basic question is whether you are adding *additional rows* or *additional columns*. We'll run through the function arguments based on concatenating rows and then come back for a look at how we perform column concatentations.

### Concatenating `DataFrame` Rows

In [None]:
# Here, I'll split the college_loan_defaults into multiple 
# sections of rows that we will then stiched back together.
part_1 = college_loan_defaults.iloc[:1000]
part_2 = college_loan_defaults.iloc[1000:2000]
part_3 = college_loan_defaults.iloc[1999:]


In [None]:
# This creates three parts:
# rows 0-999
# rows 1000-1999
# rows 1999-end -> notice 1999 appears twice
part_3.index & part_2.index

#### Basic Usage

In [None]:
# Join all three parts together pd.concat
concatenated_dataframe = pd.concat([part_3, part_1, part_2])
concatenated_dataframe.head()

In [None]:
print (concatenated_dataframe.shape[0])
print (part_1.shape[0], part_2.shape[0], part_3.shape[0])

<div class="alert alert-block alert-info">
Notice that `pd.concat` does not sort the elements of the DataFrame that it returns.
</div>