# DataFrame Methods More

## Overview 

This chapter covers many more common DataFrame methods. Again, these are the most common and fundamental methods that give you all the power to complete any data analysis. All of these methods are also found in Series objects.

### Objectives

* Handling missing data: `isna`, `notna`, `fillna`, `dropna`
* Sorting the values and the index with `sort_values` and `sort_index`
* Finding the index of the max and min with `idxmax` and `idxmin`
* Renaming column names with `rename`
* Dropping columns with `drop`
* Adding new columns to a DataFrame
* Operating with two Series at a time
* Series Methods not found in DataFrames - `str`, `dt`, `unique`, `value_counts`

In [None]:
import pandas as pd
pd.set_option('display.max_columns', 40)
movie = pd.read_csv('../data/movie.csv', index_col='title')
movie.head()

## Methods for handling missing values
pandas provides the following DataFrame methods to handle missing values:

* `isna` - Returns a Series of booleans based on whether each value is missing or not
* `notna` - Exact opposite of `isna`
* `fillna` - Fills missing values in a variety of ways
* `dropna` - Drops missing values

### Finding the number of missing values in each column
pandas gives us a direct way to find the number of non-missing values in each column with the `count` method. To find the number of missing values we will need to call the `isna` method to turn each value into a boolean and then sum each column.

In [None]:
# find the number of non-missing values per column
movie.count().head()

In [None]:
movie.isna().head()

Now, we can chain the `sum` method to count the number of missing values in each column.

In [None]:
movie.isna().sum().head()

### Find the percentage of missing values by calling the mean method
As we have seen before, taking the mean of a boolean Series returns the percentage of values that are `True` and in this case returns the percentage of missing values for each column.

In [None]:
movie.isna().mean().head()

## The `fillna` method
The `fillna` method fills the missing values in your DataFrame in a couple of different ways.

### Filling the missing values with a given constant
The most basic way to use the `fillna` method is to pass it a single value. Doing so will replace every missing value with this constant. The following replaces all missing values with the string 'FILLED'. Note, that this will replace all missing value representations - `NaN`, `NaT`, and `None`. The fifth row is the only one with visible missing values from the output below.

In [None]:
movie.fillna('FILLED').head()

Filling a DataFrame's missing values with a single constant value is unlikely what you'd need in a real situation. The `movie` DataFrame has columns of different data types and by using the string `FILLED` we will have changed the data type of any numeric column to object. For instance, the column `duration` was originally a float and is now an object.

### Using a Dictionary to fill specific columns

A more practical application would be to fill each column with a different constant value. We can use `fillna` do this by passing it a dictionary that maps the column name to the missing value replacement. The following fills the `content_rating` column with 'PG' and the `duration` column with 199.

In [None]:
movie.fillna({'content_rating': 'PG', 'duration': 199}).head()

### Fill all columns with the mean
A somewhat common approach to filling missing values is to use the mean of the column as the replacement. Taking the `mean` of a DataFrame will return a Series that has each column name labeling its mean. A Series is also a valid object that can be passed to the `fillna` method. Let's begin by finding the mean of each column. We round the mean to the nearest whole number as most of the columns use whole numbers.

In [None]:
mm = movie.mean(numeric_only=True).round(0)
mm.head()

Pass the above Series to `fillna` to use a different missing value for each column. Note, that the string columns will not be filled as they do not have a mean.

In [None]:
movie.fillna(mm).head()

Filling missing values with the mean isn't necessarily a good strategy when doing data analysis. The example above is merely used as a demonstration on how the `fillna` method works.

### Filling missing values with the preceding or following values
Instead of filling missing values with a constant, you can fill in missing values with the immediately preceding or following known value. Let's use a simple csv to clearly see how this done.

In [None]:
df = pd.read_csv('../data/missing_example.csv')
df

The `method` parameter controls whether you are going to fill in the missing value with the immediate preceding or following value. Set it equal to the string 'ffill' to fill missing values going 'forward', i.e. using the immediate preceding value.

In [None]:
df.fillna(method='ffill')

Notice that the first value for the `orders` column is still missing as there is no immediately preceding value to fill it with. Use the string 'bfill' to fill missing values going backwards.

In [None]:
df.fillna(method='bfill')

You can limit the number of consecutive values filled with the `limit` parameter. Here we limit the number filled to only 1.

In [None]:
df.fillna(method='ffill', limit=1)

When you use `fillna`, you must choose between filling with a constant demonstrated by the first example or by forward or backfilling with the `method` parameter, shown in the second example.

### The `dropna` method

The `dropna` method is used to drop entire rows or columns that contain missing values. Calling it with the default parameters will have it drop all **rows** that have one or more missing values in them. More than 1,000 rows are dropped after the operation.

In [None]:
movie.shape

In [None]:
movie.dropna().shape

The first parameter is `axis` and is defaulted to 0 (or 'index'). Changing it to 1 (or 'columns') drops any columns that have one or more missing values. Only four columns have no missing values.

In [None]:
movie.dropna(axis='columns').shape

### Drop rows where only a particular column is missing
By default, the `dropna` method will drop any rows where there are one or missing values for that entire row. In the `movie` DataFrame, there are 22 columns. If one of these 22 columns has a missing value for a particular row, then it will be dropped.

pandas gives us the option of only dropping rows where a particular column or columns have missing values. Instead of looking at all 22 columns for missing values, we can restrict the columns with the `subset` parameter by passing it a list of strings. Below, we only drop rows that are missing the `content_rating`. About 300 movies have no `content_rating`.

In [None]:
movie.dropna(subset=['content_rating']).shape

Make sure to explore the `how` and `thresh` parameters for even more options with `dropna`.

## Sorting

pandas allows you to sort either by the values of your DataFrame or by the index, which is why it has two separate methods, `sort_values` and `sort_index` instead of just a single method. Thinking in terms of these two separate components can make it easier to remember that they are distinct and different methods.

### The `sort_values` method
The `sort_values` DataFrame method sorts the DataFrame by the values in one or more columns. Pass the `by` parameter a column name or list of column names to sort. By default, the sorting takes place in ascending manner. The college dataset is used for the remainder of the examples in this chapter.

In [None]:
college = pd.read_csv('../data/college.csv', index_col='instnm')
college.head()

In [None]:
college.sort_values(by='city').head()

### Simultaneously sort two or more columns
Sort by any number of columns by passing a list of their names to the `by` parameter. The sort begins with the first column. For instance, the following sorts all the colleges by state, and then within each state, sorts by undergraduate population.

In [None]:
state_ugds_sort = college.sort_values(by=['stabbr', 'ugds'])
state_ugds_sort[['stabbr', 'ugds']].head()

Let's select just the state of Oklahoma to verify that it too is sorted by undergraduate population.

In [None]:
filt = state_ugds_sort['stabbr'] == 'OK'
cols = ['stabbr', 'ugds']
state_ugds_sort.loc[filt, cols].head()

### Sort multiple columns in different directions
The `ascending` parameter may be passed a list of booleans that correspond to the list of column names in the `by` parameter. The following sorts by state and then by undergraduate population from greatest to least.

In [None]:
college.sort_values(by=cols, ascending=[True, False]).head()

### Sort by the Index
The DataFrame may be sorted by its index with the `sort_index` method.

In [None]:
college.sort_index(ascending=False).head()

### Sort the columns
Interestingly, you can use the same `sort_index` method to sort the columns of the DataFrame. You must remember that pandas uses an Index object to contain the columns. To sort the columns, set the `axis` parameter to 'columns' or 0. This is identical to how we changed the direction of the operation of the statistical methods in the previous chapter. Perhaps this method would have been more appropriately named `sort_axis` instead since it sorts either axis.

In [None]:
college.sort_index(axis='columns').head(2)

## Finding the index of the maximum of each column with `idxmax`
The `idxmax` method is quite powerful and returns the index of the maximum value of each column. It does not work with string columns. The following selects just the numeric columns and then calls the `idxmax` method.

In [None]:
college.select_dtypes('number').idxmax()

### Interpretation of results
A Series is returned with the old column names in the index and the index that labels the maximum value of that column. For instance, the school with the highest average SAT Math scores is California Institute of Technology. The school with the highest percentage of undergraduate population above 25 years of age is Dongguk University-Los Angeles. If there are ties, pandas returns the first index that labels the maximum value.

The analogous `idxmin` method returns the index where the minimum value exists for every column.

## Column and Row Dropping and Renaming

### Dropping Columns
The `drop` method drops columns passed to the `columns` parameter as either a string or a list of strings. Let's see examples of dropping a single column and then multiple columns. Remember that methods return completely new objects so the original DataFrame is not affected. You'll need to assign the result of the operation to a new variable if you'd like to proceed with the slimmer DataFrame.

In [None]:
college.drop(columns='city').head()

In [None]:
cols = ['city', 'stabbr', 'satvrmid']
college.drop(columns=cols).head()

You can also drop rows by **label** and not integer location with the `drop` method. Again, you may use a single label or a list of labels.

In [None]:
rows = ['University of Alabama at Birmingham', 'Amridge University']
college.drop(index=rows).head()

### Renaming Columns
The `rename` method is used to rename columns. Pass a dictionary to the `columns` parameter with keys equal to the old column name and values equal to the new column name. The college dataset has lots of columns with abbreviations that are not immediately recognized. Below, we replace a couple of these columns with more explicit names.

In [None]:
college.rename(columns={'stabbr': 'state_abbreviation',
                        'relaffil': 'religious_affiliation'}).head()

### Renaming all columns at once
Instead of using the `rename` method to rename individual columns, you can assign the `columns` attribute a list of the new column names. The length of the list must be the same as the number of columns. To create an example, we will select the first five columns along with the first two rows of the `college` DataFrame.

In [None]:
college_small = college.iloc[:2, :5]
college_small

We can now overwrite all of the old columns by assigning them to a list of new column names.

In [None]:
college_small.columns = ['CITY', 'STATE', 'HBCU', 'MEN_ONLY', 'WOMEN_ONLY']
college_small

## Adding a new column to the DataFrame
A new column may be added to a DataFrame using similar syntax as selecting a single column with the brackets. The general syntax will look like the following:

```
>>> df['new_column'] = <some expression>
```

Let's add the two SAT columns together and assign the result as a new column. The new column will always be appended to the end. The last five columns are outputted to show the new column.

In [None]:
college['sat_total'] = college['satmtmid'] + college['satvrmid']
college.iloc[:5, -5:]

### Setting a column equal to a scalar value
You can create a new column by assigning it to be a single scalar value. For instance, the following assignment creates a new column of values equal to the number -99.

In [None]:
college['some num'] = -99
college.iloc[:5, -5:]

### Overwriting an existing column
You can replace the contents of a column by assigning an existing column to some other value. Below, we increase the undergraduate population of each college by 10%.

In [None]:
college['ugds'].head()

In [None]:
college['ugds'] = college['ugds'] * 1.1
college.head()

### Create a new column from a numpy array
You can create a new column by assigning it to a numpy array (or another Python sequence) that is the same length as the DataFrame. Below, we create a column of random normal variables.

In [None]:
import numpy as np
college['random normal'] = np.random.randn(len(college))
college.iloc[:5, -5:]

## Methods Available only to Series and not DataFrames
There are more than a few methods that are available only to Series objects, but the following are the most important.

### No `str` or `dt` accessor
DataFrames have no special methods just for strings or datetimes. There is no `str` or `dt` accessor. You can only use these accessors on Series objects. This usually means that you will select a column as a Series first.

### No `unique` or `value_counts`
Both `unique` and `value_counts` are only available to Series as well.

## Exercises
Use the college dataset for the first few problems. Uncomment the next line to read in the data dictionary.

In [None]:
# pd.read_csv('../data/college_data_dictionary.csv')

### Exercise 1
<span  style="color:green; font-size:16px">Find the number of missing values for each row.</span>

### Exercise 2
<span  style="color:green; font-size:16px">What percentage of rows have more than 5 missing values?</span>

### Exercise 3
<span  style="color:green; font-size:16px">How many total missing values are there in the entire DataFrame?</span>

### Exercise 4
<span  style="color:green; font-size:16px">Several of the columns contain binary data (are either 0 or 1). Can you identify the names of these columns?</span>

### Exercise 5
<span  style="color:green; font-size:16px">Read the documentation on the `dropna` method. What is the shape of the returned DataFrame when called with the defaults? Call it again except only drop rows if `ugds` is missing. What is the shape of this DataFrame?</span>

### Exercise 6
<span  style="color:green; font-size:16px">Create a new boolean column in the college named 'Verbal Higher' that is True for every college that has a higher verbal than math SAT score. Find the mean of this new column. Why does this number look suspiciously low?</span>

### Exercise 7
<span  style="color:green; font-size:16px">Find the real percentage of schools with higher verbal than math SAT scores.</span>

### Exercise 8
<span  style="color:green; font-size:16px">Use the `copy` method to create a new copy of the `college` DataFrame and assign it to variable `college2`. Select all the non-white race columns (`ugds_black` through `ugds_unkn`).  Sum the rows of this DataFrame and assign the result to a variable. Now drop all the non-white race columns from the `college2` DataFrame and assign the result to `college3`. </span>
    
<span  style="color:green; font-size:16px">Use the `insert` method to insert a new column to the right of the `ugds_white` column of the `college3` DataFrame. Name this column `ugds_nonwhite`.</span>

## Use the flights dataset with the remainder of the problems

### Exercise 9
<span  style="color:green; font-size:16px">Read in the flights dataset (`flights.csv`) and call `dropna` with the defaults. What kind of DataFrame was returned? Why? Verify that each row has at least one missing value.
</span>

### Exercise 10
<span  style="color:green; font-size:16px">Read the `dropna` docs again and keep rows that have at least 28 non-missing values. Verify the results.</span>

### Exercise 11
<span  style="color:green; font-size:16px">Find the longest `arrival_delay` for every  airline for every month.</span>