# Dealing with missing data

Real world data is not always in the state where you can directly perform machine learning operations on it. We need to process it and bring it to the format which is compatible with the algorithms. The very first problem we face in real-word datasets is the problem of missing values.  


<img src="Images/Missing_Data.jpg" width="50%">

In [None]:
import pandas as pd
from io import StringIO

Lets create soem sample dataset

In [None]:
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

In [None]:
df = pd.read_csv(StringIO(csv_data))

In [None]:
df

## What is the simplest way ?



The simplest way is to eliminate all the rows in the dataset having missing values..! 

In [None]:
df.dropna()

Why only rows? If you wish you can eliminate all the columns with missing values 

In [None]:
df.dropna(axis=1)


<img src="Images/Data_Loss.jpg" width="50%">

# Losing valuable data is crime in Data Science !!!

You cannot afford to eliminate entire row because of few missing values. Hence, there is need to find out ways to fill the missing data values.

### Imputing missing values

In [None]:
from sklearn.preprocessing import Imputer

imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
imr = imr.fit(df)
imputed_data = imr.transform(df.values)
imputed_data

#### strategy could be to replace missing value with 'mean', 'mdeian' or 'most_frequent'  value

If axis=0, then impute along columns and
If axis=1, then impute along rows

### Interpolation

In [None]:
df.interpolate()

# Handling categorical data

Another problem we generally face while working on real life datasets is a categorical data. Most of the machine learning algorithms involve mathematical calculations. Here, we cannot feed the categorical data as it is. Hence, during the preprocessing steps, it is mandatory to convert the categorical data into the numeric form.


<img src="Images/Categorical_Data.jpg" width="80%">

Lets assume that you receive following data from a T-shirt manufacturer 

In [None]:
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

#### Mapping ordinal features

In [None]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

#### Encoding class labels

In [None]:
from sklearn.preprocessing import LabelEncoder

X = df.values

color_le = LabelEncoder()
X[:, 3] = color_le.fit_transform(X[:, 3])
X

#### Performing one-hot encoding on nominal features

In [None]:
X = df[['color', 'size', 'price']].values
X

##### Apply Label encoding on Color column

In [None]:
color_le = LabelEncoder()
X[:, 0] = color_le.fit_transform(X[:, 0])
X

In [None]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features=[0])
ohe.fit_transform(X).toarray()

# Extracting features from text

 Many machine learning applications like sentiment analysis, text data is used as explanatory variable. Text must be converted to a different representation that captures as much of its information  as possible in a feature vector.

<img src="Images/Text_Data.png" width="80%">

## The bag-of-words representation

Let’s assume that, we are working on document classification problem. The collection of all the documents is called as Corpus.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
                'UNC played Duke in basketball',
                'Duke lost the basketball game'
         ]


vectorizer = CountVectorizer()

In [None]:
print (vectorizer.fit_transform(corpus).todense())
print(vectorizer.vocabulary_)