---
# Lambda School Data Science - Intro to Pandas
---
# Assignment 07 - Cleaning Data
---



##STOP! BEFORE GOING ANY FURTHER...  


1. Click "File" at the top.
2. Then, "Save a Copy in Drive."
3. Change the file name to "FIRSTNAME_LASTNAME_assign7"  

Now you have a copy of this notebook in your Drive account. This is the copy you'll edit and submit. Be sure to do this for ***every*** assignment!

---


## Set up the data with `pandas`

### 1.1 - Import pandas

In [82]:
# Import pandas with the standard alias
import pandas as pd


### 1.2 - Import the data contained in the CSV file
You can find the data [here](https://raw.githubusercontent.com/axrd/datasets/master/gdpmessy2.csv).

We'll use this [resource](https://www.iban.com/country-codes) (a list of country names and codes) later to "clean" values correctly.

In [83]:
# Read in the data

df = pd.read_csv('https://raw.githubusercontent.com/axrd/datasets/master/gdpmessy2.csv')

### 1.3 - Quickly inspect the head

In [84]:
# Look at the first 5-10 rows of the DataFrame

df[5:11][:]

Unnamed: 0.1,Unnamed: 0,countrys,GDP,code
5,5,Angola,131.4,AGO
6,6,,0.18,AIA
7,7,Antigua and Barbuda,1.24,ATG
8,8,Argentina,536.2,ARG
9,9,Armenia,10.88,ARM
10,10,Aruba,2.52,ABW


### 1. 4 - Inspect the tail

In [85]:
# Look at the tail
df.tail()


Unnamed: 0.1,Unnamed: 0,countrys,GDP,code
217,217,U.S. Virgin Islands,5.08,VIR
218,218,West Bank,6.64,WBG
219,219,Yemen,45.45,YEM
220,220,Zambia,25.61,ZMB
221,221,Zimbabwe,13.74,ZMB


### 1.5 - What is the shape of the DataFrame? 

Pay close attention to the column names; it looks like there may be an extra index column.

In [86]:
# Look at the shape of df

df.dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222 entries, 0 to 221
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  222 non-null    int64  
 1   countrys    218 non-null    object 
 2   GDP         219 non-null    float64
 3   code        217 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 7.1+ KB


### 1.6 - Index change

It's easy to just drop a column rather than doing an index_set. Let's use the `df.drop()` method to remove that extra column.

In [87]:
# Drop the additional column

df = df.drop('Unnamed: 0', axis= 1)

In [88]:
# Look at the head of the DataFrame to verify the extra column has been dropped

df.head()

Unnamed: 0,countrys,GDP,code
0,Afghanistan,21.71,AFG
1,Albania,13.4,ALB
2,Algeria,227.8,DZA
3,American Samoa,0.75,ASM
4,Andorra,4.8,AND


### 1.7 - Correct column names

Remember from lecture that we changed the column names, to correct for mispelling and formatting? We'll do that here. You can follow the method from lecture using the `.columns()` method.

In [89]:
# Rename the columns to: country, GDP, CODE (no space!)

df = df.rename(columns = {'countrys':'country'})
df


Unnamed: 0,country,GDP,code
0,Afghanistan,21.71,AFG
1,Albania,13.40,ALB
2,Algeria,227.80,DZA
3,American Samoa,0.75,ASM
4,Andorra,4.80,AND
...,...,...,...
217,U.S. Virgin Islands,5.08,VIR
218,West Bank,6.64,WBG
219,Yemen,45.45,YEM
220,Zambia,25.61,ZMB


### 1.8 - Missing values

Missing values are something that you will deal with often when cleaning data. Let's find all of the missing values first, to see how many there are. Remember to use the method which 'sums' the null values.

In [92]:
# Find the missing values (we'll subset in the next cell)
# we can also find df.info() in order to find the non null values and do the 
# calculations however a simpler and straight forward way is to use 
# df.isnull().sum()
df.isnull().sum()



country    4
GDP        3
code       5
dtype: int64

In [94]:
# Subset the df to look at the null values
df [df.isnull().any(axis = 1)]


Unnamed: 0,country,GDP,code
6,,0.18,AIA
14,"Bahamas, The",8.65,
16,Bangladesh,,BGD
20,,1.67,BLZ
55,Denmark,347.2,
70,France,,FRA
90,Hungary,129.7,
96,Ireland,,IRL
100,Jamaica,13.92,
106,,0.16,KIR


### 1.9 - Cleaning up the data: Country codes

Similar to lecture, we can see that we still have missing country codes. Use this [link](https://www.iban.com/country-codes) to find the codes and country names that are missing. Remember to use the `df.at` method to replace these codes.

In [95]:
# Fix the country codes first
# Use 'at' if you only need to get or set a single value in a DataFrame or Series.
print (df [df.isnull().any(axis =1)])

# Another way of adding correct names to entry is df['country'][6] = 'Anguilla' 
# one by one 

df.at[14, 'code'] = 'BHS'
df.at[55, 'code'] = 'DNK'
df.at[90, 'code'] = 'HUN'
df.at[100, 'code'] = 'JAM'
df.at[200, 'code'] = 'TGO'


          country     GDP code
6             NaN    0.18  AIA
14   Bahamas, The    8.65  NaN
16     Bangladesh     NaN  BGD
20            NaN    1.67  BLZ
55        Denmark  347.20  NaN
70         France     NaN  FRA
90        Hungary  129.70  NaN
96        Ireland     NaN  IRL
100       Jamaica   13.92  NaN
106           NaN    0.16  KIR
200          Togo    4.84  NaN
206           NaN    0.04  TUV


In [98]:
# Check to make sure all the codes have been replaced.
# (remember to subset the df so it only shows rows with NaN)


# checking using isnull().any(axis = 1)
df[df.isnull().any(axis = 1)]

Unnamed: 0,country,GDP,code
6,,0.18,AIA
16,Bangladesh,,BGD
20,,1.67,BLZ
70,France,,FRA
96,Ireland,,IRL
106,,0.16,KIR
206,,0.04,TUV


In [99]:
# Next fix the missing country names

df.at[6, 'country'] = 'Anguilla'
df.at[20, 'country'] = 'Belize'
df.at[106, 'country'] = 'Kiribati'
df.at[206, 'country'] = 'Tuvalu'



In [100]:
# Check to make sure all the codes and country names have been replaced.
# (remember to subset the df so it only shows rows with NaN)
df[df.isnull().any(axis = 1)]


Unnamed: 0,country,GDP,code
16,Bangladesh,,BGD
70,France,,FRA
96,Ireland,,IRL


### 1.10 - Missing GDP values

Now, it looks like we only have three GDP values missing. Using the information below, let's fix up those values.

*   France has a GDP of 2902.0 billion
*   Ireland has a GDP of 245.80 billion
*   Bangladesh has a GDP of 186.60 billion

In [101]:
# Fix missing GDP values
df.at[16, 'GDP'] = 186.60
df.at[70, 'GDP'] = 2902.0
df.at[96, 'GDP'] = 245.80


### Verify the change persisted!

Run one final check to see how many missing values there are. If you still have more than zero, then go through the above celss again to fill them in.

In [110]:
# Do a final check for missing values
print (df[df.isnull().any(axis = 1)])
print (df.isnull().sum())

Empty DataFrame
Columns: [country, GDP, code]
Index: []
country    0
GDP        0
code       0
dtype: int64


### Solution 1.10

We'll do a test on the DataFrame to check that all the missing values have been filled in.

In [104]:
# DO NOT EDIT THIS CELL
# SOLUTION to 1.10

# Check that df has no missing values
assert df.all(axis=None) == True, "You still have null values in your DataFrame."

### Submit your assignment notebook! (Make sure you've changed the name to FIRSTNAME_LASTNAME_assign7): 

1.  Click the Share button in the upper-right hand corner of the notebook.
2.  Get the shareable link.
3.  Set condition to: "Anyone with the link can comment."


---
