# Text Preparation/Feature Engineering

Feature Engineering means converting text data to numerical data. But why it is required to convert text data to numerical data?. Because our machine learning model doesn’t understand text data then we have to do feature engineering. This step is also called Feature extraction from text.

We have several methods to convert textual or Categorical data into numerical data.

### Encoding:
    Encoding is a technique of converting categorical variables into numerical values so that it could be easily fitted to a machine learning model.
    
    Different Ways of Encoding
    1. LabelEncoding
    2. OneHotEncoding
    3. Ordinal Encoding
    4. Pandas getdummies()

![spaceX Data](https://www.teslarati.com/wp-content/uploads/2020/04/Falcon-Heavy-Demo-Feb-2018-SpaceX-1-crop-2048x956.jpg)

### Import all necessary Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
# READ CSV FILE WITH PANDAS
spacex = pd.read_csv('../Data/dataset_part_2.csv')
spacex.head()

Unnamed: 0,FlightNumber,Date,BoosterVersion,PayloadMass,Orbit,LaunchSite,Outcome,Flights,GridFins,Reused,Legs,LandingPad,Block,ReusedCount,Serial,Longitude,Latitude,Class
0,1,2010-06-04,Falcon 9,6123.547647,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0003,-80.577366,28.561857,0
1,2,2012-05-22,Falcon 9,525.0,LEO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0005,-80.577366,28.561857,0
2,3,2013-03-01,Falcon 9,677.0,ISS,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B0007,-80.577366,28.561857,0
3,4,2013-09-29,Falcon 9,500.0,PO,VAFB SLC 4E,False Ocean,1,False,False,False,,1.0,0,B1003,-120.610829,34.632093,0
4,5,2013-12-03,Falcon 9,3170.0,GTO,CCSFS SLC 40,None None,1,False,False,False,,1.0,0,B1004,-80.577366,28.561857,0


By now, you should obtain some preliminary insights about how each important variable would affect the success rate, we will select the features that will be used in success prediction in the future module.

In [8]:
selected_features = spacex[['FlightNumber', 'PayloadMass', 'Orbit', 'LaunchSite', 'Flights', 'GridFins', 'Reused', 'Legs', 'Block', 'ReusedCount', 'Serial','Class']]
selected_features.head()

Unnamed: 0,FlightNumber,PayloadMass,Orbit,LaunchSite,Flights,GridFins,Reused,Legs,Block,ReusedCount,Serial,Class
0,1,6123.547647,LEO,CCSFS SLC 40,1,False,False,False,1.0,0,B0003,0
1,2,525.0,LEO,CCSFS SLC 40,1,False,False,False,1.0,0,B0005,0
2,3,677.0,ISS,CCSFS SLC 40,1,False,False,False,1.0,0,B0007,0
3,4,500.0,PO,VAFB SLC 4E,1,False,False,False,1.0,0,B1003,0
4,5,3170.0,GTO,CCSFS SLC 40,1,False,False,False,1.0,0,B1004,0


Now From the Above Data you can see that we have some of the columns that are not in numerical form so we have to convert them into numerical form with the help of Encoding.
Som columns that are required to convert are `Orbit` `LaunchSite` `Serial`


From Features Clearly we can use `OneHotEncoder` for that we need sklearn library.

In [10]:
# !pip install scikit_learn

## How to apply `get_dummies()`?

`get_dummies()` is a pandas function that helps us to apply encoding on categorical data.

In [33]:
dummies_df = pd.get_dummies(data=selected_features,drop_first=True)

Pandas `get_dummies` method is a very straight forward one step procedure to get the dummy variables for categorical features. The advantage is you can directly apply it on the dataframe and the algorithm inside will recognize the categorical features and perform get dummies operation on it.

Result is also a dataframe with coverted values you can see below with the help of `dummies_df.head()`

In [34]:
dummies_df.head()

Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount,Class,Orbit_GEO,...,Serial_B1048,Serial_B1049,Serial_B1050,Serial_B1051,Serial_B1054,Serial_B1056,Serial_B1058,Serial_B1059,Serial_B1060,Serial_B1062
0,1,6123.547647,1,False,False,False,1.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2,525.0,1,False,False,False,1.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,677.0,1,False,False,False,1.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,4,500.0,1,False,False,False,1.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,3170.0,1,False,False,False,1.0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As my point of view, the first choice method will be pandas `get_dummies`. But if the number of categorical features are huge, OneHotEncoder will be a good choice as it supports sparse matrix output.

# Get `OneHotEncoder` 
After Installation we can import the `OneHotEncoder` from sklearn.

If we want to apply the `OneHotEncoder` then we have to `train_test_split` the data after getting the features and lable.

# Get Output Column as Y and Features as X

Here X contains the feature and y contains the label.

In [58]:
X = selected_features.drop('Class',axis='columns')
y = selected_features[['Class']]

Now to have input and output data seprate now we can apply `train_test_split`

Import it from sklearn

In [61]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

**Perform `train_test_split` on data.**

In [85]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

Create instance of `OneHotEncoder()` and fit the data.

In [86]:
ohe = OneHotEncoder(drop='first',sparse=False)

##### Now Fit the data with this instance.

In [69]:
# encoded_X_train = ohe.fit_transform(X_train[['Orbit','LaunchSite','Serial']])
# encoded_X_test = ohe.transform(X_test[['Orbit','LaunchSite','Serial']])

`ValueError: Found unknown categories ['B1031', 'B1010', 'B1042', 'B1025', 'B0003'] in column 2 during transform`

If we directly run the above code then we will face the above error.

**Why this error?**

    Because when we fit the train data these categories was not available But on transforming the test data we
    are getting these Categories that are unknown for the model.

In [87]:
encoded_X_train = ohe.fit_transform(X_train[['Orbit','LaunchSite','Serial']])
encoded_X_test = ohe.fit_transform(X_test[['Orbit','LaunchSite','Serial']])

In [91]:
encoded_X_train

array([[0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

As you can see we have Numerical data for all the Categorical features.

Now we dont need these columns in data so going to drop it.

In [92]:
X_train.drop(columns=['Orbit','LaunchSite','Serial'],axis=1,inplace=True)
X_test.drop(columns=['Orbit','LaunchSite','Serial'],axis=1,inplace=True)

In [94]:
X_train.head(2)

Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount
44,45,4230.0,2,True,True,True,3.0,1
81,82,3880.0,1,True,False,True,5.0,12


The `X_train` without Categorical Columns.

As you can see we have `GridFins` `Reused` `Legs` are of Bool Type need to convert them in float.

In [97]:
X_train_float = X_train.astype(dtype='float64')
X_test_float = X_test.astype(dtype='float64')

In [98]:
X_train_float.head(2)

Unnamed: 0,FlightNumber,PayloadMass,Flights,GridFins,Reused,Legs,Block,ReusedCount
44,45.0,4230.0,2.0,1.0,1.0,1.0,3.0,1.0
81,82.0,3880.0,1.0,1.0,0.0,1.0,5.0,12.0


### Now we have `X_train_float` `X_test_float` and `encoded_X_train` `encoded_X_test`.
#### Merge these converted data and make it ready for model.

In [101]:
final_X_train = np.hstack((X_train_float,encoded_X_train))
final_X_test = np.hstack((X_test_float,encoded_X_test))

Now we have data ready to build the model.

### End