# Feature Engineering 

In real world data barely comes into a better form. With this in mind `feature engineering` is an important feature or concept in the field. It is taking whaterver information is needed for our model and encoding it with some proper labels. 

This also includes encoding lables with proper vectors and converting them into a proper matrix which can be used to train our model and make it a better enhanced model with a far better accuracy.

# Categorial Data 

When data is not in a numeric format and conatians some other information that is been embarked as strings and additional information than it is been embarked as an categorial data, for example:

In [1]:
# categorial data 
data = [
    {'price': 85000, 'rooms': 5, 'neighborhood':'South Bombay'},
    {'price': 65000, 'rooms': 3, 'neighborhood':'Borivoli'},
    {'price': 75000, 'rooms': 4, 'neighborhood':'Dadar'},
    {'price': 55000, 'rooms': 5, 'neighborhood':'Thane'},
]

data

[{'neighborhood': 'South Bombay', 'price': 85000, 'rooms': 5},
 {'neighborhood': 'Borivoli', 'price': 65000, 'rooms': 3},
 {'neighborhood': 'Dadar', 'price': 75000, 'rooms': 4},
 {'neighborhood': 'Thane', 'price': 55000, 'rooms': 5}]

In [2]:
# wrong way of encoding 
{'South Bombay':1,'Borivoli':2,'Darar':3};

In this case one more effecient way and proven technique is to use one hot encoding which can be used to make a better dictonaries and can create more than one columns for such categorial data. Common package used to do this in Scikit-learn is `DictVectorizer`.

In [3]:
# importing from feature extraction. 
from sklearn.feature_extraction import DictVectorizer

# object of dictvect...
vec = DictVectorizer(sparse=False,dtype=int)

# fitting our object with data 
vec.fit_transform(data)

array([[    0,     0,     1,     0, 85000,     5],
       [    1,     0,     0,     0, 65000,     3],
       [    0,     1,     0,     0, 75000,     4],
       [    0,     0,     0,     1, 55000,     5]], dtype=int32)

> We can notice that this encoding has changed our neighborhood variable into four different columns and which are as follows:

In [5]:
# our features for the data vector 
vec.feature_names_

['neighborhood=Borivoli',
 'neighborhood=Dadar',
 'neighborhood=South Bombay',
 'neighborhood=Thane',
 'price',
 'rooms']

# Text Features

Another common need in feature engineering is the requirement to convert text data into useful numerical data which can be further harnessed to make a better model and increase efficieny of the model.

In [6]:
# sample text data 
sample = [
    'this is a lovely day',
    'Is this even a day yet ?',
    'I am already in love with this data.'
]
sample

['this is a lovely day',
 'Is this even a day yet ?',
 'I am already in love with this data.']

For converting this data into a vectorized data we are going to break each and every word and put it into a column. Which can be done using `CountVectorizer` class in Scikit-learn.

In [7]:
# importing library

from sklearn.feature_extraction.text import CountVectorizer

# creating an object 
vec = CountVectorizer()

# fitting sample data into a x variable using vec
X = vec.fit_transform(sample)
X

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

The result is a sparse tree constructed about the words count int our data which can be visualized with the help of `Data Frame` using pandas.

In [8]:
# importing pandas 
import pandas as pd 

# creating data frame 
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,already,am,data,day,even,in,is,love,lovely,this,with,yet
0,0,0,0,1,0,0,1,0,1,1,0,0
1,0,0,0,1,1,0,1,0,0,1,0,1
2,1,1,1,0,0,1,0,1,0,1,1,0


However this approach can be very cumbersome for some of the reasons or place where data increases rapidly and can also put lots of stress on the words which are occuring frequently. 

To overcome this problem we can use *Term ferquency - inverse documented frequency (TF-IDF)* which weighs the words by how frequentlyl they appears in the document for example:

In [14]:
# importinf tfdif class 
from sklearn.feature_extraction.text import TfidfVectorizer

# creating a tfidf object
vec =  TfidfVectorizer()

# fitting sample data into X via vec
X = vec.fit_transform(sample)

# creating a data frame 
pd.DataFrame(X.toarray(), columns=vec.get_feature_names())

Unnamed: 0,already,am,data,day,even,in,is,love,lovely,this,with,yet
0,0.0,0.0,0.0,0.480458,0.0,0.0,0.480458,0.0,0.631745,0.373119,0.0,0.0
1,0.0,0.0,0.0,0.406192,0.534093,0.0,0.406192,0.0,0.0,0.315444,0.0,0.534093
2,0.396875,0.396875,0.396875,0.0,0.0,0.396875,0.0,0.396875,0.0,0.2344,0.396875,0.0
