# 3. Advanced data problems
**In this chapter, you’ll dive into more advanced data cleaning problems, such as ensuring that weights are all written in kilograms instead of pounds. You’ll also gain invaluable skills that will help you verify that values have been added correctly and that missing values don’t negatively impact your analyses.**

## Uniformity
We're going to tackle a problem that could similarly skew our data, which is unit uniformity. For example, we can have temperature data that has values in both Fahrenheit and Celsius, weight data in Kilograms and in pounds, dates in multiple formats, and so on. Verifying unit uniformity is imperative to having accurate analysis.

Column | Unit
:---|:---
Temperature | `32°C` **is also** `89.6°F`
Weight | `70 kg` **is also** `154 lb`
Date | `26-11-2020` **is also** `26, November, 2020`
Money | `100$` **is also** `85€`



### An example
Here's a dataset with average temperature data throughout the month of March in New York City. The dataset was collected from different sources with temperature data in Celsius and Fahrenheit merged together. 

```python
temperatures = pd.read_csv('temperatures.csv')
temperatures.head()
```

```
       Date  Temperature
0  03.03.19         14.0
1  04.03.19         15.0
2  03.03.19         18.0
3  04.03.19         16.0
4  03.03.19         62.6  <--
```
We can see that unless a major climate event occurred, the last value here is most likely Fahrenheit, not Celsius. 
 
To confirm the presence of these values visually, we can do so by plotting a scatter plot of our data. We can do this using matplotlib.pyplot, which was imported as plt. We use the plt dot scatter function, which takes in what to plot on the x axis, the y axis, and which data source to use. 

```python
import matplotlib.pyplot as plt
plt.scatter(x = 'date', y = 'temperature', data = temperatures)
plt.show()
```


Here is the formula for converting Fahrenheit to Celsius.

$$
C = (F - 32) \times \frac{5}{9}
$$

To convert our temperature data, we isolate all rows of temperature column where it is above 40 using the loc method. We chose 40 because it's a common sense maximum for Celsius temperatures in New York City. 
 
```python
temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperatures']
temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels
```
 
We then convert these values to Celsius using the formula above, and reassign them to their respective Fahrenheit values in temperatures. 

We can make sure that our conversion was correct with an `assert` statement, by making sure the maximum value of temperature is less than 40.

```python
# Assert conversion is correct
assert temperatures['Temperature'].max() < 40
```

### Treating date data
Here's another common uniformity problem with date data. This is a DataFrame called birthdays containing birth dates for a variety of individuals. It has been collected from a variety of sources and merged into one.

```python
birthdays.head()
```

```
          Birthday First name Last name
0         27/27/19      Rowan     Nunez    <--- ??
1         03-29-19      Brynn       Lee    <--- MM-DD-YY
2  March 3rd, 2019     Sophia    Reilly    <--- Month Day, YYY
3         24-03-19     Deacon    Prince
4         06-03-19   Griffith      Neal
```
The second one has the month, day, year format, whereas the third one has the month written out. The first one is obviously an error, with what looks like a day day year format. 

### Datetime formatting
`datetime` accepts different formats that help you format your dates as pleased.

Date | `datetime` format
:---|:---
25-12-2019 | `%d-%m-%Y`
December 25th 2019 | `%c`
12-25-2019 | `%m-%d-%Y`
... | ...

The pandas to datetime function (`pandas.to_datetime()`) automatically accepts most date formats, but could raise errors when certain formats are unrecognizable.

### Treating date data

You can treat these date inconsistencies easily by converting your date column to datetime. We can do this in pandas with the `to_datetime` function.

```python
# Converts to datetime
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
```

```
ValueError: month must be in 1..12
```

However this isn't enough and will most likely return an error, since we have dates in multiple formats, especially the weird day/day/format which triggers an error with months. 

Instead we set the `infer_datetime_format` argument to `True`, and set `errors='coerce'`. This will infer the format and return missing value for dates that couldn't be identified and converted instead of a value error.
```python
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'], 
                                       # Attempt to infer format of each date
                                       infer_datetime_format=True,
                                       # Return NA for rows where conversion failed
                                       errors = 'coerce')
```

This returns the birthday column with aligned formats, with the initial ambiguous format of day day year, being set to NAT, which represents missing values in Pandas for datetime objects.

```python
birthdays.head()
```

```
          Birthday First name Last name
0              NaT      Rowan     Nunez 
1       2019-03-29      Brynn       Lee 
2       2019-03-03     Sophia    Reilly 
3       2019-03-24     Deacon    Prince
4       2019-06-03   Griffith      Neal
```

### Datetime formatting
We can also convert the format of a datetime column using the `dt.strftime` method, which accepts a datetime format of your choice. For example, here we convert the Birthday column to day month year, instead of year month day.

```python
birthdays['Birthday'] = birthdays['Birthday'].dt.strftime(%d-%m-%Y)
birthdays.head()
```
```
          Birthday First name Last name
0              NaT      Rowan     Nunez 
1       29-03-2019      Brynn       Lee 
2       03-03-2019     Sophia    Reilly 
3       24-03-2019     Deacon    Prince
4       03-06-2019   Griffith      Neal
```

### Treating ambiguous date data
However a common problem is having ambiguous dates with vague formats. 

For example, is `03-08-2019` in March or August? Unfortunately there's no clear cut way to spot this inconsistency or to treat it.

Depending on the size of the dataset and suspected ambiguities, 
- we can either convert these dates to `NA`s and deal with them accordingly. 
- Or if you have additional context on the source of your data, you can probably infer the format. 
- If the majority of subsequent or previous data is of one format, you can probably infer the format as well. 

All in all, it is essential to properly **understand where your data comes from**, before trying to treat it, as it will make making these decisions much easier.

## Uniform currencies
In this exercise and throughout this chapter, you will be working with a retail banking dataset stored in the banking DataFrame.

In [1]:
import pandas as pd
banking = pd.read_csv('banking.csv')

In [2]:
banking.head()

Unnamed: 0.1,Unnamed: 0,cust_id,birth_date,Age,acct_amount,inv_amount,fund_A,fund_B,fund_C,fund_D,account_opened,last_transaction
0,0,870A9281,1962-06-09,58,63523.31,51295,30105.0,4138.0,1420.0,15632.0,02-09-18,22-02-19
1,1,166B05B0,1962-12-16,58,38175.46,15050,4995.0,938.0,6696.0,2421.0,28-02-19,31-10-18
2,2,BFC13E88,1990-09-12,34,59863.77,24567,10323.0,4590.0,8469.0,1185.0,25-04-18,02-04-18
3,3,F2158F66,1985-11-03,35,84132.1,23712,3908.0,492.0,6482.0,12830.0,07-11-17,08-11-18
4,4,7A73F334,1990-05-17,30,120512.0,93230,12158.4,51281.0,13434.0,18383.0,14-05-18,19-07-18


The dataset contains data on the amount of money stored in accounts (`acct_amount`), their currency (`acct_cur`), amount invested (`inv_amount`), account opening date (`account_opened`), and last transaction date (`last_transaction`) that were consolidated from American and European branches.

You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount into dollars. 

- Find the rows of `acct_cur` in `banking` that are equal to `'euro'` and store them in the variable `acct_eu`.
- Find all the rows of `acct_amount` in `banking` that fit the `acct_eu` condition, and convert them to USD by multiplying them with `1.1`.
- Find all the rows of `acct_cur` in `banking` that fit the `acct_eu` condition, set them to `'dollar'`.

```python
# Find values of acct_cur that are equal to 'euro'
acct_eu = banking['acct_cur'] == 'euro'

# Convert acct_amount where it is in euro to dollars
banking.loc[acct_eu, 'acct_amount'] = banking.loc[acct_eu, 'acct_amount'] * 1.1 

# Unify acct_cur column by changing 'euro' values to 'dollar'
banking.loc[acct_eu, 'acct_cur'] = 'dollar'

# Assert that only dollar currency remains
assert banking['acct_cur'].unique() == 'dollar'
```

## Uniform dates
After having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The `account_opened` column represents when customers opened their accounts and is a good proxy for segmenting customer activity and investment over time.

However, since this data was consolidated from multiple sources, you need to make sure that all dates are of the same format. You will do so by converting this column into a `datetime` object, while making sure that the format is inferred and potentially incorrect formats are set to missing. 

- Print the header of `account_opened` from the `banking` DataFrame and take a look at the different results.

In [3]:
# Print the header of account_opened
banking['account_opened'].head()

0    02-09-18
1    28-02-19
2    25-04-18
3    07-11-17
4    14-05-18
Name: account_opened, dtype: object

### Question
Take a look at the output. You tried converting the values to `datetime` using the default `to_datetime()` function without changing any argument, however received the following error:

`ValueError: month must be in 1..12`
Why do you think that is?

**Answers**

1. ~The `to_datetime()` function needs to be explicitly told which date format each row is in.~

2. ~The `to_datetime()` function can only be applied on `YY-mm-dd` date formats.~

3. **The `21-14-17` entry is erroneous and leads to an error.**

- Convert the `account_opened` column to `datetime`, while making sure the date format is inferred and that erroneous formats that raise error return a missing value.

In [4]:
# Print the header of account_opened
print(banking['account_opened'].head())

# Convert account_opened to datetime
banking['account_opened'] = pd.to_datetime(banking['account_opened'],
                                           # Infer datetime format
                                           infer_datetime_format=True,
                                           # Return missing value for error
                                           errors = 'coerce') 

0    02-09-18
1    28-02-19
2    25-04-18
3    07-11-17
4    14-05-18
Name: account_opened, dtype: object


- Extract the year from the amended `account_opened` column and assign it to the `acct_year` column.
- Print the newly created `acct_year` column.

In [5]:
# Get year of account opened
banking['acct_year'] = banking['account_opened'].dt.strftime('%Y')

# Print acct_year
banking['acct_year'].head()

0    2018
1    2019
2    2018
3    2017
4    2018
Name: acct_year, dtype: object

---

## Cross field validation

### Motivation
It contains flight statistics on the total number of passengers in economy, business and first class as well as the total passengers for each flight. We know that these columns have been collected and merged from different data sources, and a common challenge when merging data from different sources is data integrity, or more broadly making sure that our data is correct.

```python
import pandas as pd

flights = pd.read_csv('flights.csv')
flights.head()
```
```
    flight_number  economy_class  business_class  first_class  total_passengers
0           DL140            100              60           40               200
1           BA248            130             100           70               300
2          MEA124            100              50           50               200
3          AFR939            140              70           90               300
4          TKA101            130             100           20               250
```

### Cross field validation
*The use of **multiple** fields in a dataset to sanity check data intergrity*

For example in the flights dataset, this could be summing economy, business and first class values and making sure they are equal to the total passengers on the plane. 

```
    flight_number  economy_class  business_class  first_class  total_passengers
0           DL140            100       +      60      +    40        =      200
1           BA248            130       +     100      +    70        =      300
2          MEA124            100       +      50      +    50        =      200
3          AFR939            140       +      70      +    90        =      300
4          TKA101            130       +     100      +    20        =      250
```
This could be easily done in Pandas, by first subsetting on the columns to sum, then using the sum method with the axis argument set to 1 to indicate row wise summing. 

We then find instances where the total passengers column is equal to the sum of the classes. And find and filter out instances of inconsistent passenger amounts by subsetting on the equality we created with brackets and the tilde symbol.

```python
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1)
passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistent_pass = flights[passenger_equ]
```

### Cross field validation
Here's another example containing user IDs, birthdays and age values for a set of users. 

```python
users.head()
```
```
    user_id  Age    Birthday
0     32985   22  1998-03-02
1     94387   27  1993-12-04
2     34236   42  1978-11-24
3     12551   31  1989-01-03
4     55212   18  2002-07-02
```
We can for example make sure that the age and birthday columns are correct by subtracting the number of years between today's date and each birthday.

We can do this by first making sure the Birthday column is converted to datetime with the pandas to datetime function. We then create an object storing today's date using the datetime package's date dot today function. We then calculate the difference in years between today's date's year, and the year of each birthday by using the dot dt dot year attribute of the user's Birthday column. We then find instances where the calculated ages are equal to the actual age column in the users DataFrame. We then find and filter out the instances where we have inconsistencies using subsetting with brackets and the tilde symbol on the equality we created.

```python
import pandas as pd
import datetime as dt

# Convert to datetime and get today's date
users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()
# For each row in the Virthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match
age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]
```

### What to do when we catch inconsistencies?
So what should be the course of action in case we spot inconsistencies with cross-field validation? Just like other data cleaning problems, there is no one size fits all solution, as often the best solution requires an in depth understanding of our dataset. We can decide to either drop inconsistent data, set it to missing and impute it, or apply some rules due to domain knowledge. All these routes and assumptions can be decided upon only when you have a good understanding of where your dataset comes from and the different sources feeding into it.


## Cross field or no cross field?
Throughout this course, you've been immersed in a variety of data cleaning problems from range constraints, data type constraints, uniformity and more.

In this lesson, you were introduced to cross field validation as a means to sanity check your data and making sure you have strong data integrity.

Now, you will map different applicable concepts and techniques to their respective categories.

Cross field validation | Not cross field validation
:---|:---
Row wise operations such as `.sum(axix = 1)` | Making sure a `subscription_date` column has no values set in the future
Confirming the Age provided by users by cross checking their birthdays | The use of the `.astype()` method
 | Making sure that a `revenue` column is a numeric column

## How's our data integrity?
New data has been merged into the `banking` DataFrame that contains details on how investments in the `inv_amount` column are allocated across four different funds A, B, C and D.

Furthermore, the age and birthdays of customers are now stored in the `age` and `birth_date` columns respectively.

You want to understand how customers of different age groups invest. However, you want to first make sure the data you're analyzing is correct. You will do so by cross field checking values of `inv_amount` and `age` against the amount invested in different funds and customers' birthdays.

In [6]:
import pandas as pd
import datetime as dt

- Find the rows where the sum of all rows of the `fund_columns` in `banking` are equal to the `inv_amount` column.
- Store the values of `banking` with consistent `inv_amount` in `consistent_inv`, and those with inconsistent ones in `inconsistent_inv`.

In [7]:
# Store fund columns to sum against
fund_columns = ['fund_A', 'fund_B', 'fund_C', 'fund_D']

# Find rows where fund_columns row sum == inv_amount
inv_equ = banking[fund_columns].sum(axis = 1) == banking['inv_amount']

# Store consistent and inconsistent data
consistent_inv = banking[inv_equ]
inconsistent_inv = banking[~inv_equ]

# Store consistent and inconsistent data
print("Number of inconsistent investments: ", inconsistent_inv.shape[0])

Number of inconsistent investments:  8


- Store today's date into `today`, and manually calculate customers' ages and store them in `ages_manual`.
- Find all rows of `banking` where the `age` column is equal to `ages_manual` and then filter `banking` into `consistent_ages` and `inconsistent_ages`.

In [None]:
# Store today's date and find ages
today = dt.date.today()
ages_manual = today.year - banking['birth_date'].dt.year

# Find rows where age column == ages_manual
age_equ = ages_manual == banking['age']

# Store consistent and inconsistent data
consistent_ages = banking[age_equ]
inconsistent_ages = banking[~age_equ]

# Store consistent and inconsistent data
print("Number of inconsistent ages: ", inconsistent_ages.shape[0])

```
Number of inconsistent ages:  4
```

*There are only 8 and 4 rows affected by inconsistent `inv_amount` and `age` values respectively. In this case, it's best to investigate the underlying data sources before deciding on a course of action.*

---

## Completeness

### What is missing data?
Missing data is one of the most common and most important data cleaning problems. Essentially, missing data is ***when no data value is stored for a variable in an observation***.

Missing data is most commonly represented as `NA` or`NaN`, but can take on arbitrary values like `0` or `.`. 

It's commonly due to **technical** or **human errors**. Missing data can take many forms, so let's take a look at an example.

### Airquality example
It contains temperature and CO2 measurements for different dates.
```python
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality)
```
```
            Date   Temperature   CO2
987   20/04/2004          16.8   0.0
2119  07/06/2004          18.7   0.8
2451  20/06/2004         -40.0   NaN    <---
1984  01/06/2004          19.6   1.8
8299  19/02/2005          11.2   1.2
...      ...              ...    ...     
```
We can see that the CO2 value in this row is represented as NaN.

We can find rows with missing values by using the `.isna` method, which returns `True` for missing values and `False` for complete values across all our rows and columns.
```python
# Return missing values
airquality.isna()
```
```
       Date   Temperature    CO2
987   False         False  False
2119  False         False  False
2451  False         False   True
1984  False         False  False
8299  False         False  False
```

We can also chain the `.isna` method with the `.sum` method, which returns a breakdown of missing values per column in our dataframe.

```python
# Get summary of missingness
airquality.isna().sum()
```
```
Date             0
Temperature      0
CO2            366
dtype: int64
```
We notice that the CO2 column is the only column with missing values - let's find out why and dig further into the nature of this missingness by first visualizing our missing values.

### Missingno
***Useful package for visualizing and understanding missing data***

The missingno package allows to create useful visualizations of our missing data. We visualize the missingness of the airquality DataFrame with the `msno.matrix` function, and show it with pyplot's show function from matplotlib.

```python
import missingno as msno
import matplotlib.pyplot as plt
# Visualize missingness
msno.matrix(airquality)
plt.show()
```
The matrix essentially shows how missing values are distributed across a column. We see that missing CO2 values are randomly scattered throughout the column, but is that really the case? Let's dig deeper.

### Airquality example
We first isolate the rows of airquality with missing CO2 values in one DataFrame, and complete CO2 values in another.
```python
# Isolate missing and complete values aside
missing = airquality[airquality['CO2'].isna()]
complete = airquality[~airquality['CO2'].isna()]
```

Then, let's use the describe method on each of the created DataFrames.

```python
# Describe complete DataFrame
complete.describe()
```
```
       Temperature           CO2
count  8991.000000   8991.000000
mean     18.317829      1.739584
std       8.832116      1.537580
min      -1.900000      0.000000
...       ...           ...
max      44.600000     11.900000
```

```python
# Describe complete DataFrame
missing.describe()
```
```
       Temperature   CO2
count   366.000000   0.0
mean    -39.655738   NaN   <---
std       5.988716   NaN
min     -49.000000   NaN   <---
...       ...        ...
max     -30.000000   NaN   <---
```

We see that for all missing values of CO2, they occur at really low temperatures, with the mean temperature at minus 39 degrees and a minimum and maximum of -49 and -30 respectively. 

Let's confirm this visually with the missngno package.

We first sort the DataFrame by the temperature column. Then we input the sorted dataframe to the `.matrix` function from msno. This leaves us with this matrix.

```python
sorted_airquality = airquality.sort_values(by='Temperature')
msno.matrix(sorted_airquality)
```

### Missingness types

*Missing Completely at Random*<br/>**(MCAR)** | *Missing at Random*<br/>**(MAR)** | *Missing Not at Random*<br/>**(MNAR)**
:---|:---|:---
No systematic relationship between missing data and other values<br/><br/>Data entry erros when inputting data | Systematic relationship between missing data and other ***observed*** values<br/><br/>Missing ozone data for high temperatures | Systematic relationship between missing data and ***unobserved*** values<br/><br/>Missing temperature values for high temperatures

### How to deal with missing data?
**Simple approaches:**
1. Drop missing data
2. Impute with statistical measures *(mean, median, mode...)*
**More complex approaches:**
1. Imputing using an algorithmic approach
2. Impute with machine learning models

### Dealing with missing data
We'll just explore the simple approaches to dealing with missing data. Let's grab another look at the header of airquality.

```python
airquality.head()
```
```
         Date   Temperature   CO2
0  05/03/2005           8.5   2.5
1  23/08/2004          21.8   0.0
2  18/02/2005           6.3   1.8 
3  08/02/2005         -31.0   NaN
4  13/03/2005          19.9   0.1
```

### Dropping missing values
We can drop missing values, by using the `.dropna` method, alongside the subset argument which lets us pick which column's missing values to drop.

```python
# Drop missing values
airquality_dropped = airquality.dropna(subset = ['CO2'])
airquality_dropped.head()
```
```
         Date   Temperature   CO2
0  05/03/2005           8.5   2.5
1  23/08/2004          21.8   0.0
2  18/02/2005           6.3   1.8 
4  13/03/2005          19.9   0.1
5  02/04/2005          17.0   0.8
```

### Replacing with statistical measures
We can also replace the missing values of CO2 with the mean value of CO2, by using the `.fillna` method, which is in this case 1.73.

```python
co2_mean = airquality['CO2'].mean()
airquality_imputed = airquality.fillna({'CO2': co2_mean})
airquality_imputed.head()
```
```
         Date   Temperature        CO2
0  05/03/2005           8.5   2.500000
1  23/08/2004          21.8   0.000000
2  18/02/2005           6.3   1.800000
3  08/02/2005         -31.0   1.739584
4  13/03/2005          19.9   0.100000
```
Fillna takes in a dictionary with columns as keys, and the imputed value as values. We can even feed custom values into fillna pertaining to our missing data if we have enough domain knowledge about our dataset.

## Is this missing at random?
Missingness types can be described as the following:

- **Missing Completely at Random**: No systematic relationship between a column's missing values and other or own values.
- **Missing at Random**: There is a systematic relationship between a column's missing values and other ***observed*** values.
- **Missing not at Random**: There is a systematic relationship between a column's missing values and ***unobserved*** values.

You have a DataFrame containing customer satisfaction scores for a service. What type of missingness is the following?
   
- *A customer `satisfaction_score` column with missing values for highly dissatisfied customers.*

1. ~Missing completely at random.~

2. ~Missing at random.~

3. **Missing not at random.**

**Answer: 3.** This is a clear example of missing not at random, where low values of satisfaction_score are missing because of inherently low satisfaction.

## Missing investors
Dealing with missing data is one of the most common tasks in data science. There are a variety of types of missingness, as well as a variety of types of solutions to missing data.

You just received a new version of the `banking` DataFrame containing data on the amount held and invested for new and existing customers. However, there are rows with missing `inv_amount` values.

You know for a fact that most customers below 25 do not have investment accounts yet, and suspect it could be driving the missingness. 

- Print the number of missing values by column in the `banking` DataFrame.
- Plot and show the missingness matrix of `banking` with the `msno.matrix()` function.

In [None]:
# Print number of missing values in banking
print(banking.isna().sum())

import missingno as msno
import matplotlib.pyplot as plt
# Visualize missingness matrix
msno.matrix(banking)
plt.show()

```
cust_id              0
age                  0
acct_amount          0
inv_amount          13
account_opened       0
last_transaction     0
dtype: int64
```

- Isolate the values of `banking` missing values of `inv_amount` into `missing_investors` and with non-missing `inv_amount` values into `investors`.

In [None]:
# Isolate missing and non missing values of inv_amount
missing_investors = banking[banking['inv_amount'].isna()]
investors = banking[~banking['inv_amount'].isna()]

### Question
Now that you've isolated `banking` into `investors` and `missing_investors`, use the `.describe()` method on both of these DataFrames in the console to understand whether there are structural differences between them. What do you think is going on?

In [None]:
investors.describe()

```
             age  ...    inv_amount
count  84.000000  ...     84.000000
mean   43.559524  ...  44717.885476
std    10.411244  ...  26031.246094
min    26.000000  ...   3216.720000
25%    34.000000  ...  22736.037500
50%    45.000000  ...  44498.460000
75%    53.000000  ...  66176.802500
max    59.000000  ...  93552.690000

[8 rows x 3 columns]
```

In [None]:
missing_investors.describe()

```
             age  ...  inv_amount
count  13.000000  ...         0.0
mean   21.846154  ...         NaN
std     1.519109  ...         NaN
min    20.000000  ...         NaN
25%    21.000000  ...         NaN
50%    21.000000  ...         NaN
75%    23.000000  ...         NaN
max    25.000000  ...         NaN

[8 rows x 3 columns]
```

1. ~The data is missing completely at random and there are no drivers behind the missingness.~

2. **The `inv_amount` is missing only for young customers, since the average age in `missing_investors` is 22 and the maximum age is 25.**

3. ~The `inv_amount` is missing only for old customers, since the average age in `missing_investors` is 42 and the maximum age is 59.~

**Answer: 2.**

- Sort the `banking` DataFrame by the `age` column and plot the missingness matrix of `banking_sorted`.

In [None]:
# Sort banking by age and visualize
banking_sorted = banking.sort_values(by='age')
msno.matrix(banking_sorted)
plt.show()

## Follow the money
In this exercise, you're working with another version of the `banking` DataFrame that contains missing values for both the `cust_id` column and the `acct_amount` column.

You want to produce analysis on how many unique customers the bank has, the average amount held by customers and more. You know that rows with missing `cust_id` don't really help you, and that on average `acct_amount` is usually 5 times the amount of `inv_amount`.

In this exercise, you will drop rows of `banking` with missing `cust_ids`, and impute missing values of `acct_amount` with some domain knowledge.

- Use `.dropna()` to drop missing values of the `cust_id` column in `banking` and store the results in `banking_fullid`.
- Use `inv_amount` to compute the estimated account amounts for `banking_fullid` by setting the amounts equal to `inv_amount * 5`, and assign the results to `acct_imp`.
- Impute the missing values of `acct_amount` in `banking_fullid` with the newly created `acct_imp` using `.fillna()`.

In [None]:
# Drop missing values of cust_id
banking_fullid = banking.dropna(subset = ['cust_id'])

# Compute estimated acct_amount
acct_imp = banking_fullid['inv_amount'] * 5

# Impute missing acct_amount with corresponding acct_imp
banking_imputed = banking_fullid.fillna({'acct_amount':acct_imp})

# Print number of missing values
print(banking_imputed.isna().sum())

```
    cust_id             0
    acct_amount         0
    inv_amount          0
    account_opened      0
    last_transaction    0
    dtype: int64
```

*As you can see no missing data left, you can definitely bank on getting your analysis right.*