## Addressing missing data

#### Dealing with missing data
It is important to deal with missing data before starting your analysis.

One approach is to drop missing values if they account for a small proportion, typically five percent, of your data.

Working with a dataset on plane ticket prices, stored as a pandas DataFrame called planes, you'll need to count the number of missing values across all columns, calculate five percent of all values, use this threshold to remove observations, and check how many missing values remain in the dataset.

![dealing%20with%20missing%20data.png](attachment:dealing%20with%20missing%20data.png)

![dealing%20with%20missing%20data%202.png](attachment:dealing%20with%20missing%20data%202.png)

Awesome! By creating a missing values threshold and using it to filter columns, you've managed to remove missing values from all columns except for Additional_Info and Price.

#### Strategies for remaining missing data
The five percent rule has worked nicely for your planes dataset, eliminating missing values from nine out of 11 columns!

Now, you need to decide what to do with the "Additional_Info" and "Price" columns, which are missing 300 and 368 values respectively.

You'll first take a look at what "Additional_Info" contains, then visualize the price of plane tickets by different airlines.

The following imports have been made for you:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

![strategies%20for%20missing%20data.png](attachment:strategies%20for%20missing%20data.png)

![strategies%20for%20missing%20data%202.png](attachment:strategies%20for%20missing%20data%202.png)

Excellent! You don't need the "Additional_Info" column, and should impute median "Price" by "Airline" to accurately represent the data!

#### Imputing missing plane prices
Now there's just one column with missing values left!

You've removed the "Additional_Info" column from planes—the last step is to impute the missing data in the "Price" column of the dataset.

As a reminder, you generated this boxplot, which suggested that imputing the median price based on the "Airline" is a solid approach!

![Imputing%20missing%20plane%20prices.png](attachment:Imputing%20missing%20plane%20prices.png)

Impressive imputing! You converted a grouped DataFrame to a dictionary and then used it to conditionally fill missing values for "Price" based on the "Airline"! Now let's explore how to perform exploratory analysis on categorical data.

## Converting and analyzing categorical data

#### Finding the number of unique values
You would like to practice some of the categorical data manipulation and analysis skills that you've just seen. To help identify which data could be reformatted to extract value, you are going to find out which non-numeric columns in the planes dataset have a large number of unique values.

pandas has been imported for you as pd, and the dataset has been stored as planes.

![Finding%20the%20number%20of%20unique%20values.png](attachment:Finding%20the%20number%20of%20unique%20values.png)

Great looping! Interestingly, "Duration" is currently an object column whereas it should be a numeric column, and has 362 unique values! Let's find out more about this column.

#### Flight duration categories
As you saw, there are 362 unique values in the "Duration" column of planes. Calling planes["Duration"].head(), we see the following values:

0        19h
1     5h 25m
2     4h 45m
3     2h 25m
4    15h 30m
Name: Duration, dtype: object
Looks like this won't be simple to convert to numbers. However, you could categorize flights by duration and examine the frequency of different flight lengths!

You'll create a "Duration_Category" column in the planes DataFrame. Before you can do this you'll need to create a list of the values you would like to insert into the DataFrame, followed by the existing values that these should be created from.

![Flight%20duration%20categories.png](attachment:Flight%20duration%20categories.png)

Nicely done! Now you've created your categories and values, it's time to conditionally add the categories into the DataFrame.

#### Adding duration categories
Now that you've set up the categories and values you want to capture, it's time to build a new column to analyze the frequency of flights by duration!

The variables flight_categories, short_flights, medium_flights, and long_flights that you previously created are available to you.

Additionally, the following packages have been imported: pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt.

![adding%20duration%20categories.png](attachment:adding%20duration%20categories.png)

Creative categorical transformation work! It's clear that the majority of flights are short-haul, and virtually none are longer than 16 hours! Now let's take a deep dive into working with numerical data.

## Working with numeric data
#### Flight duration
You would like to analyze the duration of flights, but unfortunately, the "Duration" column in the planes DataFrame currently contains string values.

You'll need to clean the column and convert it to the correct data type for analysis.

![Flight%20duration.png](attachment:Flight%20duration.png)

Creative cleaning skills! Once the data was in the right format, you were able to plot the distribution of 'Duration' and see that the most common flight length is around three hours.

#### Adding descriptive statistics
Now "Duration" and "Price" both contain numeric values in the planes DataFrame, you would like to calculate summary statistics for them that are conditional on values in other columns.

![Adding%20descriptive%20statistics.png](attachment:Adding%20descriptive%20statistics.png)

![Adding%20descriptive%20statistics%202.png](attachment:Adding%20descriptive%20statistics%202.png)

![Adding%20descriptive%20statistics%203.png](attachment:Adding%20descriptive%20statistics%203.png)

Terrific transforming! Looks like Jet Airways has the largest standard deviation in price, Air India has the largest median duration, and New Delhi, on average, is the most expensive destination. Now let's look at how to handle outliers.

## Handling outliers

#### What to do with outliers

![what%20to%20do%20with%20outliers.png](attachment:what%20to%20do%20with%20outliers.png)

Great work! It can be difficult deciding what to do with outliers, but you must know how to handle them as they often occur in the real-world!

#### Identifying outliers

![identifying%20outlier%201.png](attachment:identifying%20outlier%201.png)

![identifying%20outlier%202.png](attachment:identifying%20outlier%202.png)

![identifying%20outlier%203.png](attachment:identifying%20outlier%203.png)

Impressive outlier detecting! Histograms, boxplots, and descriptive statistics are also useful methods for identifying extreme values. Now let's deal with them!

#### Removing outliers
While removing outliers isn't always the way to go, for your analysis, you've decided that you will only include flights where the "Price" is not an outlier.

Therefore, you need to find the upper threshold and then use it to remove values above this from the planes DataFrame.

pandas has been imported for you as pd, along with seaborn as sns.

![removing%20outliers.png](attachment:removing%20outliers.png)

Ridiculous outlier removal skills! You managed to create thresholds based on the IQR and used them to filter the planes dataset to eliminate extreme prices. Originally the dataset had a maximum price of almost 55000, but the output of planes.describe() shows the maximum has been reduced to around 23000, reflecting a less skewed distribution for analysis!