# 7. DataFrame Methods More

### Objectives

+ Understand how to use the following methods to handle missing data: **`isna`**, **`notna`**, **`fillna`**, **`dropna`**
+ Sort values and the index with **`sort_values`** and **`sort_index`**
+ Find the index of the max and min with **`idxmax`** and **`idxmin`**
+ Renaming column names with **`rename`**
+ Dropping columns with **`drop`**
+ Explore uniqueness with **`nunique`** and **`drop_duplicates`**
+ Add new columns to a  DataFrame
+ Operate with two Series at a time
+ Methods DataFrames do not have - **`str`**, **`dt`**, **`unique`**, **`value_counts`**
+ Use the function **`pd.to_numeric`** to coerce string columns to numeric data types

# Introduction
This notebook covers many more common DataFrame methods. Again, these are the most common and fundamental methods that give you all the power to complete any data analysis. All of these methods are also found in Series objects.

In [None]:
import pandas as pd
pd.options.display.max_columns = 100

In [None]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head()

## Methods for handling missing values
Pandas provides the following methods to handle missing values:

* **`isna`** - Returns a Series of booleans based on whether each value is missing or not
* **`notna`** - Exact opposite of **`isna`**
* **`fillna`** - fills missing values in a variety of ways
* **`dropna`** - Drops the missing values from the Series

### Finding the number of missing values in each column
Pandas gives us a direct way to find the number of non-missing values in each column with the **`count`** method. To find the number of missing values we will need to call the **`isna`** method to turn each value into a boolean and then sum each

In [None]:
# find the number of non-missing values per column
college.count().head()

In [None]:
college.isna().head()

In [None]:
college.isna().sum().head()

### Find the percentage of missing values by calling the mean method

In [None]:
college.isna().mean()

**`fillna`** and **`dropna`** will be covered in a notebook specific to handling missing values.

## Sorting with `sort_values`
The **`sort_values`** DataFrame method sorts the DataFrame by one or more columns. Pass the **`by`** parameter a column name or list of column names to sort. By default the sorting takes place in ascending manner.

In [None]:
college.sort_values(by='city').head()

### Simultaneously sort two or more columns
Sort by any number of columns by passing a list of their names to the **`by`** parameter. The sort happens by going left to right through the columns. For instance, the following sorts all the colleges by state, and then within each state, sorts by undergraduate population.

In [None]:
state_ugds_sort = college.sort_values(by=['stabbr', 'ugds'])
state_ugds_sort[['stabbr', 'ugds']].head()

Let's select just the state of Oklahoma to verify that it too is sorted by undergraduate population.

In [None]:
filt = state_ugds_sort['stabbr'] == 'OK'
cols = ['stabbr', 'ugds']
state_ugds_sort.loc[filt, cols].head()

## Sort multiple columns in different directions
The **`ascending`** parameter may be passed a list of booleans that correspond to the list of column names in the **`by`** parameter. The following sorts by state and then by undergraduate population from greatest to least.

In [None]:
college.sort_values(by=cols, ascending=[True, False]).head()

## Sort by the Index
The DataFrame may be sorted by its index with the **`sort_index`** method.

In [None]:
college.sort_index(ascending=False).head()

## Finding the index of the maximum of each column with `idxmax`
The **`idxmax`** method is quite powerful and returns the index value for the maximum of each column. It does not work with string columns. The following selects just the numeric columns and then calls the **`idxmax`** method.

In [None]:
college.select_dtypes('number').idxmax()

### Interpretation of results
A Series is returned with the old column names in the index and the maximum index as the values. For instance, the school with the highest average SAT Math scores is California Institute of Technology. The school with the highest percentage of undergraduate population above 25 years of age is Dongguk University-Los Angeles.

### Dropping Columns
The **`drop`** method drops columns passed to the **`columns`** parameter as either a string or a list of strings. Let's see examples of dropping a single column and then multiple columns:

In [None]:
college.drop(columns='city').head()

In [None]:
cols = ['city', 'stabbr', 'satvrmid']
college.drop(columns=cols).head()

### Renaming Columns
The **`rename`** columns method is used to rename columns. Pass a dictionary to the **`columns`** parameter with keys equal to the old column name and values equal to the new column name.

The college dataset has lots of columns with abbreviations that are not immediately recognized. We replace a couple of these columns with more explicit names.

In [None]:
college.rename(columns={'hbcu': 'Historically Black Colleges and Universities',
                       'stabbr': 'State Abbreviation'}).head()

### The `nunique` method
The **`nunique`** method returns the number of unique values for each column. It defaults to ignoring missing values. It technically is an aggregation method as it returns a single value for each column.

In [None]:
college.nunique()

### Adding a new column to the DataFrame
A new column may be added to a DataFrame using similar syntax to selecting a single column with the brackets. The general syntax will look like the following:

```
>>> df['new_column'] = <some expression>
```

Let's add the two SAT columns together.

In [None]:
college['sat_total'] = college['satmtmid'] + college['satvrmid']

## Important methods Available only to Series and not DataFrames
There are are more than a few methods that are available only to Series objects, but the following are the most important.

### No `str` or `dt` accessor
DataFrames have no special methods just for strings or datetimes. There is no **`str`** or **`dt`** accessor. You can only use these accessors on Series objects. This usually means that you will select a column as a Series first.

### No `unique` or `value_counts`
Both **`unique`** and **`value_counts`** are only available to Series as well.

In [None]:
pd.read_csv('../data/college_data_dictionary.csv')

# Exercises
Use the college dataset for the first few problems. You may want/need to read the 'Extra' section below first.

### Problem 1
<span  style="color:green; font-size:16px">Find the number of missing values for each row.</span>

In [None]:
# your code here

### Problem 2
<span  style="color:green; font-size:16px">What percentage of rows have more than 5 missing values?</span>

In [None]:
# your code here

### Problem 3
<span  style="color:green; font-size:16px">How many total missing values are there in the entire DataFrame?</span>

In [None]:
# your code here

### Problem 4
<span  style="color:green; font-size:16px">Several of the columns contain binary data (are either 0 or 1). Can you identify the names of these columns?</span>

In [None]:
# your code here

### Problem 5
<span  style="color:green; font-size:16px">Read the documentation on the `dropna` method. What is the shape of the returned DataFrame when called with the defaults? Call it again except only drop rows if `ugds` is missing. What is the shape of this DataFrame?</span>

In [None]:
# your code here

### Problem 6
<span  style="color:green; font-size:16px">Create a new boolean column in the college named 'Verbal Higher' that is True for every college that has a verbal than math SAT score. Find the mean of this new column. Why does this number look suspiciously low?</span>

In [None]:
# your code here

### Problem 7
<span  style="color:green; font-size:16px">Find the real percentage of schools with higher verbal than math SAT scores.</span>

In [None]:
# your code here

## Use the flights dataset with the remainder of the problems

### Problem 8
<span  style="color:green; font-size:16px">Read in the flights dataset (`flights.csv`) and call `dropna` with the defaults. What kind of DataFrame was returned? Why? Verify that each row has at least one missing value.
</span>

In [None]:
# your code here

### Problem 9
<span  style="color:green; font-size:16px">Read the `dropna` docs again and keep rows that have at least 28 non-missing values. Verify the results.</span>

In [None]:
# your code here

### Problem 10
<span  style="color:green; font-size:16px">Find the longest `arrival_delay` for every  airline for every month.</span>

In [None]:
# your code here

# Extra

More important material on DataFrame methods.

## Dropping duplicate rows with `drop_duplicates`
The default call to the **`drop_duplicates`** method returns only unique rows of the DataFrame. It does not use the index value in its search for duplicates. If two or more row are duplicated, the first row is kept. 

Let's see if there are any duplicate rows in the college dataset.

In [None]:
college.shape

In [None]:
college.drop_duplicates().shape

Interestingly, there are some rows that have an exact duplicate.

### Drop duplicates based on a subset of columns
The **`drop_duplicates`** method gives you the **`subset`** parameter to check for duplicates within the columns passed to it. The following returns a single row for each state.

In [None]:
college.drop_duplicates(subset='stabbr').head()

In [None]:
college.drop_duplicates(subset='stabbr').shape

The data includes several territories and not just the 50 states.

## Case Study: Selecting the school with the maximum SAT scores for each state
Let's say we are interested in finding the school with the maximum total SAT score for each state. The dataset gives us the math and verbal SAT scores.

First ensure that we have a column for the total SAT score.

In [None]:
college['sat_total'] = college['satmtmid'] + college['satvrmid']

By default, this column gets added to the end. Let's select just a subset of the college DataFrame so it is easier to view.

In [None]:
cols = ['stabbr', 'ugds', 'satmtmid', 'satvrmid', 'sat_total']
total_sat = college[cols]
total_sat.head()

#### Sort data first
If we sort by state and total SAT score we will get close to finding the schools with the best scores for each state.

In [None]:
total_sat.sort_values(['stabbr', 'sat_total'], ascending=[True, False]).head(20)

### Use `drop_duplicates` to finish the problem
We would like to keep the first for every state after the data has been sorted. The **`drop_duplicates`** method does this for us.

In [None]:
total_sat.sort_values(['stabbr', 'sat_total'], ascending=[True, False]) \
         .drop_duplicates(subset='stabbr').head(10)

# Case-Study 2: Hidden String Columns
When reading in data, a common misfortune is to have a column that you believe to be numeric to be a string. This happens when columns contain a mix of numeric and string data. By default, pandas leaves these columns as strings. This happened with the both the `MD_EARN_WNE_P10` (median earnings after 10 years of enrollment) and `GRAD_DEBT_MDN_SUPP` columns (median debt of completers).

Visually inspecting the first few rows makes it appear that these columns are numeric.

In [None]:
cols = ['md_earn_wne_p10', 'grad_debt_mdn_supp']
college[cols].head()

But looking at their data types reveals that they are not.

In [None]:
college[cols].dtypes

### Output values as NumPy array
To see that they are strings, you can output their values as a NumPy array.

In [None]:
college['md_earn_wne_p10'].values[:10]

## Boolean indexing to discover alphabetic strings
It's very likely that alphabetic string data does appear in these columns. One way to reveal them is to use the string method **`isalpha`** which returns True if one of the characters is alphabetic. This boolean Series can be used to filter for just the alphabetic values.

There is one tiny issue. There are also **`NaN`** values in this Series. - the **`isalpha`** method returns **`NaN`** and not False for them. For boolean indexing to work, we need all values of the filtering Series to be boolean. We use **`fillna`** to fill these values with False so that our boolean indexing works.

In [None]:
earnings = college['md_earn_wne_p10']
filt = earnings.str.isalpha().fillna(False)
earnings[filt].head()

## Count all the alphabetic strings
Let's use the **`value_counts`** method to see the count of all the different alphabetic strings found. The only alphabetic string is 'PrivacySuppressed'.

In [None]:
earnings[filt].value_counts()

### Coercing to missing values
We can still convert this column to a numeric by converting all the non-numeric strings to missing values. We have to use the pandas **function** **`to_numeric`**. It is accessed directly from **`pd`**. We must call it with **`errors`** set to the string **`coerce`**. This forces any non-numeric data to **`NaN`**.

In [None]:
earnings_numeric = pd.to_numeric(earnings, errors='coerce')
earnings_numeric.head()

Notice the data type is now float.

In [None]:
earnings_numeric.dtype

## Replace the columns
Our DataFrame still has not been altered. Let's re-assign both columns to their new values:

In [None]:
college['md_earn_wne_p10'] = pd.to_numeric(college['md_earn_wne_p10'], errors='coerce')
college['grad_debt_mdn_supp'] = pd.to_numeric(college['grad_debt_mdn_supp'], errors='coerce')

### Verify that the data types have been changed

In [None]:
college.dtypes