# Feature Engineering

1. _IMPUTATION_

The most simple solution to the missing values is to drop the rows or the entire column. There is not an optimum 
threshold for dropping but you can use 70% as an example value and try to drop the rows and columns which have missing 
values with higher than this threshold. This is as follows

In [None]:
threshold = 0.7
#Dropping columns with missing value rate higher than threshold
data = data[data.columns[data.isnull().mean() < threshold]] 

#Dropping rows with missing value rate higher than threshold
data = data.loc[data.isnull().mean(axis=1) < threshold]

a) _NUMERICAL IMPUTATION_ : Numerical imputation is done based on the nature of the feature. We can generally replace with 0 or median or mean. We can use the following code as follows:

In [None]:
#Filling all missing values with 0
data = data.fillna(0)
#Filling missing values with medians of the columns
data = data.fillna(data.median())
data = data.fillna(data.mean())

b) _Categorical Imputation_ : Replacing the missing values with the maximum occurred value in a column is a good option for handling categorical columns. We can use the following code as follows:

In [None]:
#Max fill function for categorical columns
data['column_name'].fillna(data['column_name'].value_counts().idxmax(), inplace=True)

2. _HANDLING OUTLIERS_
Outliers are handed using two ways : 1) Standard deviation
                                     2) Percentiles
        If a value has a distance to the average higher than x * standard deviation, it can be assumed as an outlier.
        X here is usuallynot trivial. It can be taken any value between 2 and 4
        
        Another option for handling outliers is to cap them instead of dropping. So you can keep your data size and at the end of the day, it might be better for the final model performance.On the other hand, capping can affect the distribution of the data, thus it better not to exaggerate it.
        

In [None]:
#Dropping the outlier rows with standard deviation
factor = 3
upper_lim = data['column'].mean () + data['column'].std () * factor
lower_lim = data['column'].mean () - data['column'].std () * factor

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

#Dropping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)

data = data[(data['column'] < upper_lim) & (data['column'] > lower_lim)]

#Capping the outlier rows with Percentiles
upper_lim = data['column'].quantile(.95)
lower_lim = data['column'].quantile(.05)
data.loc[(df[column] > upper_lim),column] = upper_lim
data.loc[(df[column] < lower_lim),column] = lower_lim


3) _Binning_: 
    
    The main motivation of binning is to make the model more robust and prevent overfitting, however, it has a cost to
    the performance. Every time you bin something, you sacrifice information and make your data more regularized. 
    The trade-off between performance and overfitting is the key point of the binning process. In my opinion, 
    for numerical columns, except for some obvious overfitting cases, binning might be redundant for some kind of 
    algorithms, due to its effect on model performance. However, for categorical columns, the labels with low frequencies
    probably affect the robustness of statistical models negatively. Thus, assigning a general category to these 
    less frequent values helps to keep the robustness of the model. For example, if your data size is 100,000 rows, 
    it might be a good option to unite the labels with a count less than 100 to a new category like “Other”.

In [None]:
Eg:
        Country
0      Spain
1      Chile
2  Australia
3      Italy
4     Brazil
conditions = [
    data['Country'].str.contains('Spain'),
    data['Country'].str.contains('Italy'),
    data['Country'].str.contains('Chile'),
    data['Country'].str.contains('Brazil')]

choices = ['Europe', 'Europe', 'South America', 'South America']

data['Continent'] = np.select(conditions, choices, default='Other')
     Country      Continent
0      Spain         Europe
1      Chile  South America
2  Australia          Other
3      Italy         Europe
4     Brazil  South America

In [None]:
4) Transformation:
    
    Eg Log transformation
    data['log+1'] = (data['value']+1).transform(np.log)
    
    #Negative Values Handling
#Note that the values are different
data['log'] = (data['value']-data['value'].min()+1) .transform(np.log)

In [None]:
5) One hot encoding :
    Will be in the part of data preparation

In [None]:
6) Scaling:
    a. Standardisation
    
    
data_scaled = StandardScaler()
data_scaled.fit(project_data['column'].values.reshape(-1,1)) # finding the mean and standard deviation of this data
print(f"Mean : {data_scaled.mean_[0]}, Standard deviation : {np.sqrt(data_scaled.var_[0])}")

# Now standardize the data with above maen and variance.
column_values = data_scaled.transform(data['column'].values.reshape(-1, 1))

    b. Normalization
# Normalize total_bedrooms column
x_data = np.array(data['total_bedrooms'])
normalized_data = preprocessing.normalize([x_data])

In [None]:
7) Extracting date:
    
#Transform string to date
data['date'] = pd.to_datetime(data.date, format="%d-%m-%Y")

#Extracting Year
data['year'] = data['date'].dt.year

#Extracting Month
data['month'] = data['date'].dt.month

#Extracting passed years since the date
data['passed_years'] = date.today().year - data['date'].dt.year

#Extracting passed months since the date
data['passed_months'] = (date.today().year - data['date'].dt.year) * 12 + date.today().month - data['date'].dt.month

#Extracting the weekday name of the date
data['day_name'] = data['date'].dt.day_name()