**Exercise 1 - Data cleaning**

---

This first exercise will be about data preparation and cleaning. In this exercise we will work with tabular data, work with some basic preprocessing and data cleaning operations.

**1.1 Data preparation \[5\]**

---

In [1]:
# import libraries used during this exercise
import pandas as pd
import numpy as np
import random

In this exercise we will work with a meteorite landings data provided by NASA (available [here](https://www.kaggle.com/nasa/meteorite-landings)). This dataset contains information about the landing geodesic position, mass, type, class, year, etc.

In [17]:
# Dataset read
df = pd.read_csv('data/meteorite-landings.csv')
# number of rows with number of columns
df


Unnamed: 0,name,id,nametype,recclass,mass,fall,year,reclat,reclong,GeoLocation
0,Aachen,1,Valid,L5,21.0,Fell,1880.0,50.77500,6.08333,"(50.775000, 6.083330)"
1,Aarhus,2,Valid,H6,720.0,Fell,1951.0,56.18333,10.23333,"(56.183330, 10.233330)"
2,Abee,6,Valid,EH4,107000.0,Fell,1952.0,54.21667,-113.00000,"(54.216670, -113.000000)"
3,Acapulco,10,Valid,Acapulcoite,1914.0,Fell,1976.0,16.88333,-99.90000,"(16.883330, -99.900000)"
4,Achiras,370,Valid,L6,780.0,Fell,1902.0,-33.16667,-64.95000,"(-33.166670, -64.950000)"
...,...,...,...,...,...,...,...,...,...,...
45711,Zillah 002,31356,Valid,Eucrite,172.0,Found,1990.0,29.03700,17.01850,"(29.037000, 17.018500)"
45712,Zinder,30409,Valid,"Pallasite, ungrouped",46.0,Found,1999.0,13.78333,8.96667,"(13.783330, 8.966670)"
45713,Zlin,30410,Valid,H4,3.3,Found,1939.0,49.25000,17.66667,"(49.250000, 17.666670)"
45714,Zubkovsky,31357,Valid,L6,2167.0,Found,2003.0,49.78917,41.50460,"(49.789170, 41.504600)"


Given this loaded dataframe, your first task will be to prepare this data, and for now you should **not** remove rows from your data:
- Convert all data to numerical
- Remove not relevant columns (if any)
- Remove redundant data (if any)
- Remove non-sense data (if any)

In the file *data_cleaning.py* there are some suggested functions to be implemented, feel free to implement the functions differently.


In [4]:
# import the data_cleaning.py where you will need to implement the core functions
from mlrcv.data_cleaning import *

def data_preparation(df):
    #########################################################################
    # Implement your functions in data_cleaning.py then call them here
    ########################### Your data preparation #######################
    
    
    remap_dict = {name: random.random() for name in df['name']} # dictionary is made by random values just to check function
        
        
        
    #remap_values(df,'name',remap_dict) # uncomment to use this function but it takes some time 
 
    categorical_to_num(df,'recclass')
    categorical_to_num(df,'nametype')
        
    col = ['name','GeoLocation','fall'] 
    
         
    drop_column(df,col)

         
     
    
    df = df.apply(pd.to_numeric, errors='ignore') # converts the data to numeric datatype
    

    #########################################################################
    return df

df = data_preparation(df)
df

Unnamed: 0,id,nametype,recclass,mass,year,reclat,reclong
0,1,0,0,21.0,1880.0,50.77500,6.08333
1,2,0,1,720.0,1951.0,56.18333,10.23333
2,6,0,2,107000.0,1952.0,54.21667,-113.00000
3,10,0,3,1914.0,1976.0,16.88333,-99.90000
4,370,0,4,780.0,1902.0,-33.16667,-64.95000
...,...,...,...,...,...,...,...
45711,31356,0,43,172.0,1990.0,29.03700,17.01850
45712,30409,0,350,46.0,1999.0,13.78333,8.96667
45713,30410,0,10,3.3,1939.0,49.25000,17.66667
45714,31357,0,4,2167.0,2003.0,49.78917,41.50460


After your data cleaning you should notice many differences between the data before and after your data preparation, for example, less columns, different values. Besides that, the columns data type should be all numerical now (no object type).


**1.2 Data cleaning \[5\]**

---

While preparing your data you should have noticed some undefined values (or *NaN*) in some fields. Before, they should only be ignored, now we need to handle this. In this second task you should deal with those *NaN* values. At this point, of course, you are allowed to remove rows from the DataFrame:
- Implement your *NaN-handling* functions on *data_cleaning.py*
- Clean your data (no more *NaNs*)

Again, some suggested functions are already predefined on *data_cleaning.py*, however feel free to change it.

In [14]:
def data_cleaning(df):
    #########################################################################
    # Implement your functions in data_cleaning.py then call them here
    ########################### Your data cleaning ##########################
   
    df = replace_nan_with_mean_class(df,'mass','mass')
    df = remove_nan_rows(df,'reclat')
    df = remove_row_within_range(df,'year',2000.0,2010.0)

    #########################################################################

    return df


df = data_cleaning(df)

df.isnull().values.any()


False

In [16]:
df.dtypes


id            int64
nametype      int64
recclass      int64
mass        float64
year        float64
reclat      float64
reclong     float64
dtype: object

If your implementations have worked the output from the code above should be **False** (this call check if the data frame has any NaN/null value).

**Assignment Submission**

---

You should zip and submit the ```ex1_data_clean.ipynb``` file together with all the ```.py``` files inside the ```mlrcv/``` directory.

You can automatically generate the submission file using the provided ```zip_submission.sh``` script by running:

```
bash zip_submission.sh
```

This will zip the necessary files for your submission and generate the ```ex1_mlrcv_submission.zip``` file to be submit via ecampus.