1. Using the documentation for Recursive Feature Selection, apply this process to the
crime dataset to create the best multivariate linear regression model
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html .
You can select what you’re trying to predict, but be sure to indicate what that is. Be sure
to explain what RFE is in the markdown. You should be able to answer this using what’s
on the documentation page + what you already know.


RFE - Recursive Feature Elimination

RFE ranks features by recursive feature elimination to select best features. The method recursively eliminates the least important features based on specific attributes taken by estimator.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.svm import SVC,SVR
from sklearn.datasets import load_digits
from sklearn.feature_selection import RFE
from sklearn.ensemble import AdaBoostRegressor


crime_df = pd.read_csv("crime_data.csv")
crime_df.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,X7
0,478,184,40,74,11,31,20
1,494,213,32,72,11,43,18
2,643,347,57,70,18,16,16
3,341,565,31,71,11,25,19
4,773,327,67,72,9,29,24


In [3]:
crime_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   X1      50 non-null     int64
 1   X2      50 non-null     int64
 2   X3      50 non-null     int64
 3   X4      50 non-null     int64
 4   X5      50 non-null     int64
 5   X6      50 non-null     int64
 6   X7      50 non-null     int64
dtypes: int64(7)
memory usage: 2.9 KB


In [4]:
# Used the AdaBoostRegressor meta-estimator model. Targeted 1 feature to be selected.

X = crime_df.drop('X1',axis=1)
y = crime_df['X1']

estimator = AdaBoostRegressor(random_state=0, n_estimators=100)
selector = RFE(estimator, n_features_to_select=1, step=1)
selector = selector.fit(X, y)
 
filter = selector.support_
ranking = selector.ranking_

print("Mask data: ", filter)
print("Ranking: ", ranking)


Mask data:  [ True False False False False False]
Ranking:  [1 2 3 5 6 4]


In [5]:
#now to show what features were selected
features = np.array(X.columns)
print("All features:")
print(features)

print("Selected features:")
print(features[filter])

All features:
['X2' 'X3' 'X4' 'X5' 'X6' 'X7']
Selected features:
['X2']


2. Create a list of preprocessing steps you should try when working to build a model. Briefly describe what each step is. Work with your group to come up with the most comprehensive list you can

- Plot the overall distribution of your data
    - This wil help us know if our data are very skewed
    - This will also help us know what further preprocessing steps we need
- Look at what data types you have, how many features, how many rows, etc.
    - How many categorical vs. numeric?
    - Any datetime values? etc.
- Cleaning nulls 
    - Removing features that have too many nulls (e.g. over 50%)
- Handle missing data
    - Might want to replace null with 0, mean, median, etc. depending on our data
    - Need to be carefulj on how this affects our data
- Standardize your data (normalize)
- Scale your data
    - Transform data to be the same scale (especially useful for different distance measures that are very sensitive to scale)
- Convert categorical to numeric
    - This is crucial for machine learning algorithms
- One-hot encoding
- Cleaning up your data in general
    - Missing values
    - Changing strings to numeric or proper dates
    - Spaces in odd places
    - Weird characters, etc.
- Correlation matrix
    - See where you have multicollinearity
    - If applicable, drop features that are highly correlated with each other before running your model
    - Drop redundant columns