**Exercise 1 - Data cleaning**

---

This first exercise will be about data preparation and cleaning. In this exercise we will work with tabular data, work with some basic preprocessing and data cleaning operations.

**1.1 Data preparation \[5\]**

---

In [26]:
# import libraries used during this exercise
import pandas as pd
import numpy as np
import data_cleaning as func


In this exercise we will work with a meteorite landings data provided by NASA (available [here](https://www.kaggle.com/nasa/meteorite-landings)). This dataset contains information about the landing geodesic position, mass, type, class, year, etc.

In [27]:
# Dataset read
df = pd.read_csv('meteorite-landings.csv')
df

Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.77500,6.08333,"(50.775000, 6.083330)"
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.183330, 10.233330)"
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.00000,"(54.216670, -113.000000)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.90000,"(16.883330, -99.900000)"
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95000,"(-33.166670, -64.950000)"
...,...,...,...,...,...,...,...,...,...,...
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,1990.0,29.03700,17.01850,"(29.037000, 17.018500)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,1999.0,13.78333,8.96667,"(13.783330, 8.966670)"
45713,Zlin,30410,Valid,H4,3.3,Found,1939.0,49.25000,17.66667,"(49.250000, 17.666670)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,2003.0,49.78917,41.50460,"(49.789170, 41.504600)"


Given this loaded dataframe, your first task will be to prepare this data, and for now you should not remove rows from your data:
- Convert all data to numerical
- Remove not relevant columns (if any)
- Remove redundant data (if any)
- Remove non-sense data (if any)

In the file *data_cleaning.py* there are some suggested functions to be implemented, feel free to implement the functions differently. (**Note:** for now ignore *NaN* values.)


In [28]:
# import the data_cleaning.py where you will need to implement the core functions
from data_cleaning import *

def data_preparation(df):
    #########################################################################
    # Implement your functions in data_cleaning.py then call them here
    ########################### Your data preparation #######################
    
    #dropping columns
    df = drop_column(df, 'name')
    df = drop_column(df , 'GeoLocation')

    #correcting ranges
    df = remove_row_within_range(df, 'year' , 860 ,2016)
    df = remove_row_within_range(df, 'reclat' , -90 ,90)
    df = remove_row_within_range(df, 'reclong' , -179.99999 ,180)
    
    #mapping from dictionaries
    #defining dictionaries
    dict_fall = {'Fall' : 1 ,'Found': 2}
    dict_nametype ={'Valid' : 1, 'Relict' :2}
    #replacing from dictionaries
    df = remap_values(df,'fall',dict_fall)
    df = remap_values(df,'nametype',dict_nametype)

    #changing to numbers 
    df = categorical_to_num(df ,'recclass')
    
    
    #########################################################################
    return df

df = data_preparation(df)
df

Unnamed: 0,id,nametype,recclass,mass,fall,year,reclat,reclong
0,1,1.0,1,21.0,,1880.0,50.77500,6.08333
1,2,1.0,2,720.0,,1951.0,56.18333,10.23333
2,6,1.0,3,107000.0,,1952.0,54.21667,-113.00000
3,10,1.0,4,1914.0,,1976.0,16.88333,-99.90000
4,370,1.0,5,780.0,,1902.0,-33.16667,-64.95000
...,...,...,...,...,...,...,...,...
45711,31356,1.0,43,172.0,2.0,1990.0,29.03700,17.01850
45712,30409,1.0,348,46.0,2.0,1999.0,13.78333,8.96667
45713,30410,1.0,11,3.3,2.0,1939.0,49.25000,17.66667
45714,31357,1.0,5,2167.0,2.0,2003.0,49.78917,41.50460


After your data cleaning you should notice many differences between the data before and after your data preparation, for example, less columns, different values. Besides that, the columns data type should be all numerical now (no object type).


**1.2 Data cleaning \[5\]**

---

While preparing your data you should have noticed some undefined values (or *NaN*) in some fields. Before, they should only be ignored, now we need to handle this. In this second task you should deal with those *NaN* values. At this point, of course, you are allowed to remove rows from the DataFrame:
- Implement your *NaN-handling* functions on *data_cleaning.py*
- Clean your data (no more *NaNs*)

Again, some suggested functions are already predefined on *data_cleaning.py*, however feel free to change it.

In [29]:
def data_cleaning(df):
    #########################################################################
    # Implement your functions in data_cleaning.py then call them here
    ########################### Your data cleaning ##########################
    
    #replacing nan from mass
    df = replace_nan_with_mean_class(df,'mass','recclass')
    
    #removing unneccessary rows with nans
    df = remove_nan_rows(df , 'nametype')
    df = remove_nan_rows(df, 'mass')
    df = remove_nan_rows(df ,'fall')


    ###########    #df = remove_nan_rows(df ,'year')##############################################################
    
    return df
df = data_cleaning(df)
df.isnull().values.any()
#df = df.dropna(axis =1)

False

If your implementations have worked the output from the code above should be **False** (this call check if the data frame has any NaN/null value).