# Data Pre-processing
<p float="left">
<img src="https://upload.wikimedia.org/wikipedia/en/3/39/Tartu_%C3%9Clikool_logo.svg" width="100"/>
</p>

We have three files that describe some data of mobile phones.
- Mobile data 1.csv: This file contains
    - id - ID
    - battery_power - Total energy a battery can store in one time measured in mAh
    - blue - Has bluetooth or not
    - clock_speed - speed at which microprocessor executes instructions
    - dual_sim - Has dual sim support or not
    - fc - Front Camera mega pixels
    - four_g - Has 4G or not
    - int_memory - Internal Memory in Gigabytes
    - m_dep - Mobile Depth in cm
    - mobile_wt - Weight of mobile phone
    - n_cores - Number of cores of processor
    - pc - Primary Camera mega pixels
    - px_height - Pixel Resolution Height
    - px_width - Pixel Resolution Width
    - ram - Random Access Memory in Megabytes
    - sc_h - Screen Height of mobile in cm
    - sc_w - Screen Width of mobile in cm
    - talk_time - longest time that a single battery charge will last when you are
    - three_g - Has 3G or not
    - touch_screen - Has touch screen or not
    - wifi - Has wifi or not
    - price_range - This is the target variable with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost).
- Mobile data 2.csv: This file contains
    - ID	
    - Brand	- the make of the phone, e.g., Apple, Samsunig, Ericsson, etc.
    - Phone	- the model fo the phone, e.g. iPhone 12
    - Picture URL small - a hyperlink to an image of the phone
    - Body Dimensions - this is compound value that is not uniform. It contains  w x h x d in both mm and inches
    - Body Weight - the information is given also in grams and lb but also for several configurations
    - Display Resolution - this column combines different values of w x h pixels, screen ratio and pixel density
- Price ranges.csv - simple file that contains the range id and the min and max value for each range

### Our objective is to merge the data into one data set. 
### But, first, we have to make some pre-processing on some of the individual files
For mobile data 1.csv, we need to make the following pre-processing (transformations)
- Remove duplicates
- Rename columns
- Handle missing data (this will be revisited with more details when talking about data cleansing)
 - Just drop the rows
 - Fill with some neighbor value
 - Fill with some statistical value (mean or median)
- Handling outliers
- Standardizing the data

For mobile data 2.csv, we need to make the following pre-processing (transformations)

- Out of the Body dimensions column, we need to extract three columns: Width, Height, Depth all in mm. **This is a one to one transformation.**
- Out of the Body weight, we need to extract the weight in grams for each possible offering. For example, for iPads, ones that come with WiFi only have a different weight different from those that come with additional 4G support. **This is a one to many transformation**

In [None]:
%config Completer.use_jedi = False

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler

In [None]:
pd

## Pre-processing the first data set

### Profiling
First we do some generic profiling. We check the shape, the data types, the columns and a general statistical description of the data.

In [None]:
# read in simple csv file
data_1 = pd.read_csv('./data/input/mobile_data_1.csv')

In [None]:
# check amount of rows and columns 
data_1.shape

In [None]:
data_1.dtypes

In [None]:
data_1.describe()

In [None]:
# check how much memory does the dataframe consume
data_1.info()
# if column contains "object" data types, use 
# data_1.info(memory_usage='deep')

In [None]:
data_1.head()

In [None]:
data_1.columns

In [None]:
# NB - notice when you overwrite the existing dataframe, or when are you creating a new df
data_1 = data_1.rename(columns = {'blue' : 'bluetooth', 
                              'fc' : 'fc_megapixel',
                              'pc' : 'pc_megapixel',
                              'm_dep' : 'm_depth'})

In [None]:
data_1.head()

In [None]:
# modify display behaviour
pd.get_option('display.max_columns')

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
data_1.head()

In [None]:
data_1.sample(100)

### Handling duplicates

In [None]:
dupes = data_1.duplicated()

In [None]:
dupes.head()

In [None]:
data_1 = data_1.drop_duplicates()
data_1.shape

### Handling Missing Data

In [None]:
data_1.isnull().sum()

#### Replacing Nans with 0
_data imputation_

In [None]:
data_1['fc_megapixel'] = data_1['fc_megapixel'].fillna(0)

data_1.isnull().sum()

In [None]:
data_1.sort_values(by=['fc_megapixel']).head()

#### Filling Forward or Backward
If we supply a method parameter to the fillna() method, we can fill forward or backward as we need. To fill forward, use the methods pad or fill, and to fill backward, use bfill and backfill.

NB! Make sure this makes sense for your data.

In [None]:
data_1[data_1['ram'].isna()]

In [None]:
data_1.head()

In [None]:
len(data_1['ram'].unique())

In [None]:
data_1['ram'] = data_1['ram'].fillna(method='backfill')

len(data_1['ram'].unique())

In [None]:
data_1.head()

In [None]:
data_1.isnull().sum()

#### Replacing nan with median of the column

In [None]:
data_1['mobile_wt'].median()

In [None]:
data_1['mobile_wt'] = data_1['mobile_wt'].fillna(data_1['mobile_wt'].median())

In [None]:
data_1['mobile_wt'].head()

In [None]:
data_1 = data_1.dropna()

In [None]:
data_1.isnull().sum()

In [None]:
data_1.shape

### Handling outliers
Another type of profiling is to check for legitimate values and handling outliers. We need to do that with numerical values only.

In [None]:
numerical_data = data_1.drop(['id','phone_id', 'bluetooth', 'dual_sim', 'four_g', 'three_g', 
                          'touch_screen', 'wifi', 'price_range'], axis=1)
numerical_data.head()

In [None]:
categorical_data = data_1[['id', 'phone_id','bluetooth', 'dual_sim', 'four_g', 'three_g', 
                         'touch_screen', 'wifi', 'price_range']]
categorical_data.head()

In [None]:
# view a single column's outliers
sns.boxplot(data=numerical_data['ram'],
            orient = 'v')

In [None]:
# view all columns
bp = sns.boxplot(data = numerical_data)

bp.set_xticklabels(bp.get_xticklabels(), rotation=90)

To better visualize, you may want to standardize the values of the columns.

In [None]:

scaler = StandardScaler()

scaled_array = scaler.fit_transform(numerical_data)

In [None]:
scaled_data = pd.DataFrame(scaled_array, columns = numerical_data.columns)

scaled_data.head()

In [None]:
scaled_data.describe()

In [None]:
bp = sns.boxplot(data = scaled_data)

bp.set_xticklabels(bp.get_xticklabels(), rotation=90)

In [None]:
# we can use Interquartile range (IQR) to distinguish outliers 
Q1 = numerical_data.quantile(0.25)
Q3 = numerical_data.quantile(0.75)

IQR = Q3 - Q1

print(IQR)

In [None]:
outliers_removed_data = numerical_data[~ ((numerical_data < (Q1 - 1.5 * IQR)) \
                                     | (numerical_data > (Q3 + 1.5 * IQR))).any(axis=1)]

outliers_removed_data.shape

In [None]:
bp = sns.boxplot(data = outliers_removed_data)

bp.set_xticklabels(bp.get_xticklabels(), rotation=90)

### Put things together

In [None]:
final_data_1 = outliers_removed_data.join(categorical_data, how='inner')
final_data_1.head()

### Write this part of the data to disk

In [None]:
final_data_1.to_csv('./data/output/mobile_data_1_cleaned.csv', index = False)

## Pre-processing the second data set

In [None]:
data_2 = pd.read_csv('./data/input/mobile_data_2.csv')

In [None]:
data_2.head(30)

We can observe that some of the columns have compound information, e.g. Body Dimension and Body weight. Also, some records have meaningless data. Also, some columns are irrelevant to the analysis, like the Picture url.
So, we will do the following.
- Drop irrelevant columns, e.g. Picture url
- Drop records with meaningless data. We can drop records for which columns like Body dimensions, Body weight and display resolution are less than 20 characters long.


In [None]:
data_2 = data_2.drop('Picture URL small', axis =1)
data_2.columns

In [None]:
data_2 = data_2.rename (columns = {'Body Dimensions':'BodyDimensions', 'Body Weight':'BodyWeight', 'Display Resolution':'DisplayResolution'})
data_2.dtypes

In [None]:
data_2.info()
#data_2.info(memory_usage='deep')

We can observe that Pandas is not able to infer the type on its own for the non-ID columns. So, let's enforce the type

In [None]:
data_2['Brand']= data_2['Brand'].astype('str')
data_2['Phone']= data_2['Phone'].astype('str')
data_2['BodyDimensions']= data_2['BodyDimensions'].astype('str')
data_2['BodyWeight']= data_2['BodyWeight'].astype('str')
data_2['DisplayResolution']= data_2['DisplayResolution'].astype('str')

In [None]:
meaningful_data_2 = data_2[(data_2['BodyWeight'].str.len() > 10) & (data_2['DisplayResolution'].str.len() > 20)]

In [None]:
data_2.shape

In [None]:
meaningful_data_2.shape 

We need to extract the weight as a new column for the value of grams only. We can achieve that by assuming that we split the string describing the weight by the **g** character.

In [None]:
WeightInGrams = meaningful_data_2.apply(lambda row: row['BodyWeight'].split('g')[0].strip(), axis=1)

In [None]:
WeightInGrams

In [None]:
meaningful_data_2['WeightInGrams'] = WeightInGrams 
# make a copy beforehand to avoid circular reference and/or making unwanted changes in previous dataframes
# meaningful_data_3 = meaningful_data_2.copy()

In [None]:
meaningful_data_3.shape

In [None]:
meaningful_data_3.sample(100)

We can notice in the original data that the ''BodyWeight'' column has actually multiple values. For example, we can find value like _331 g (Wi-Fi) / 341 g (3G/LTE) (11.68 oz);_. This means that there are different weights for different configurations. Our objective is to create a new row for each different weight and put another column called configuration.

One way is to first get those rows that have multiple weight values

In [None]:
multiple_weight_phones = meaningful_data_3[(meaningful_data_3['BodyWeight'].str.find('/') > -1 ) | (meaningful_data_3['BodyWeight'].str.find(',') > -1 )]

In [None]:
multiple_weight_phones.shape

In [None]:
multiple_weight_phones

Let's first unify the splitting character. We make sure that ',' is transformed to '/'. Also, we have another occurrence of '/' in some configurations e.g., (3G/LTE). In this case, this will complicate the splitting. The relevant occurrence of weight separators is followed by a space. So, we make another replacement where we replace '/ ' with '//'

In [None]:
unifiedSeparator = multiple_weight_phones.apply (lambda row: row['BodyWeight'].replace(',','/').replace('/ ','//'), axis=1)

In [None]:
unifiedSeparator.head()

In [None]:
multiple_weight_phones = multiple_weight_phones.drop(['BodyWeight'], axis=1)
multiple_weight_phones.shape

In [None]:
multiple_weight_phones['BodyWeight'] =unifiedSeparator
multiple_weight_phones.shape

In [None]:
multiple_weight_phones['BodyWeight']

In [None]:
multiple_weight_2 = multiple_weight_phones.apply(lambda row: row['BodyWeight'].split('//'), axis=1).explode()
multiple_weight_2.head()

In [None]:
multiple_weights_3 = multiple_weight_phones.join(pd.DataFrame(multiple_weight_2,columns=['Config']))
multiple_weights_3.head()

In [None]:
multiple_weights_3.columns

In [None]:
multiple_weights_3 = multiple_weights_3.drop(['Brand', 'Phone', 'BodyDimensions', 'DisplayResolution',
       'WeightInGrams', 'BodyWeight'], axis=1)
multiple_weights_3.head()

Now, we can split the config column to weight and config

In [None]:
weight = multiple_weights_3.apply(lambda row: row['Config'].split('g')[0].strip(), axis=1)
weight.head()

In [None]:
config = multiple_weights_3.apply(lambda row: row['Config'].split('g')[1].strip().split(')')[0][1:], axis=1)
config.head()

In [None]:
multiple_weights_3= multiple_weights_3.drop(['Config'], axis = 1)
multiple_weights_3.columns

In [None]:
multiple_weights_3['Config'] = config

In [None]:
multiple_weights_3['Weight'] = weight 

In [None]:
multiple_weights_3.head()

In [None]:
meaningful_data_4 = meaningful_data_3.join(multiple_weights_3, how='left',rsuffix='_multiple')

In [None]:
meaningful_data_4.sample(50)

In [None]:
meaningful_data_4 = meaningful_data_4.drop(['ID_multiple'], axis=1)

In [None]:
meaningful_data_4.sample(50)

In [None]:
meaningful_data_4['Weight'].fillna(meaningful_data_4['WeightInGrams'], inplace=True) 

In [None]:
meaningful_data_4.sample(50)

In [None]:
meaningful_data_4 = meaningful_data_4.drop(['WeightInGrams'], axis=1)

In [None]:
meaningful_data_4.head(20)

Checking the dataframes held in memory

In [None]:
%whos DataFrame

In [None]:
# Find the memory footprint
df_list = []    
for var in dir():
    if isinstance(locals()[var], pd.core.frame.DataFrame) and var[0] != '_':
        df_list.append(var)    

In [None]:
memory_cons = 0
for d in df_list:
    memory_cons += locals()[d].memory_usage(deep=True).sum()

print(f'Total memory consumed by dataframes: {round(memory_cons/1024/1024,1)} MB')

#### Assignment
 

In [None]:
# Fill the empty "Config" column in the meaningful_data_4 dataframe with value "Standard"

config_values = meaningful_data_4['Config'].fillna('Standard')
meaningful_data_5 = meaningful_data_4
meaningful_data_5['Config'] = config_values
meaningful_data_5.head(50)

In [None]:
# Transform the body dimensions and get separate height, width, and depth dimensions in mm

dimensions = meaningful_data_5.apply(lambda row: row['BodyDimensions'].split('mm')[0].strip(), axis=1)
dimensions.head()

In [None]:
height_dim = dimensions.apply(lambda row: row.split(' x ')[0].strip())
width_dim = dimensions.apply(lambda row: row.split(' x ')[1].strip())
depth_dim = dimensions.apply(lambda row: row.split(' x ')[2].strip())

In [None]:
height_dim.head()
#width_dim.head()
#depth_dim.head()

In [None]:
final_data_2 = (meaningful_data_5
                     .join(pd.DataFrame(height_dim, columns=['Height']))
                     .join(pd.DataFrame(width_dim, columns=['Width']))
                     .join(pd.DataFrame(depth_dim, columns=['Depth']))
                    )

In [None]:
final_data_2.head()

In [None]:
# Inner join the three data sets:
## preprocessed mobile 1 dataset
## preprocessed mobile 2 dataset
## price ranges


price_ranges = pd.read_csv('./data/input/price_ranges.csv', header=0, names=['price_range', 'Min', 'Max'])

joined_data = (final_data_1
               .join(final_data_2,how='inner')
               .join(price_ranges,how='inner',on='price_range',rsuffix='_r')
              )

joined_data.head()

In [None]:
joined_data.shape

In [None]:
# Save the final data frame in a file 'ready_for_analysis.csv'

joined_data.to_csv('./data/output/ready_for_analysis.csv')

#### Check also:

* <a href="https://pandas.pydata.org/docs/reference/index.html">API reference</a>
* <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Official cheat sheet</a>