## Project 2 - Regression Challenge

### Predict the price of homes at sale for the Ames Iowa Housing dataset

#### Data Exploration and Manipulation


In [1]:
# # Installations 
# !pip install numpy
# !pip install pandas
# !pip install matplotlib
# !pip install seaborn
# !pip install scikit-learn

In [None]:
import numpy as np
import pandas as pd

from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV, LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, train_test_split

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')

%config InlineBackend.figure_format = 'retina'
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
df_original = pd.read_csv('./Data/train (2).csv')
df_ktest_original = pd.read_csv('./Data/test (2).csv')


In [None]:
df = df_original.drop(['Id', 'PID'], axis=1)
df_ktest = df_ktest_original.drop(['Id', 'PID'], axis=1)

In [None]:
df_ktest.shape

In [None]:
df.head()

In [None]:
df_ktest.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum().sort_values(ascending=False)

In [None]:
# replace spaces in column names and convert all columns to lowercase:
df.columns = [x.lower().replace(' ','_') for x in df.columns]

In [None]:
df.head()

In [None]:
df_object = df.select_dtypes(include=['object'])

In [None]:
df_numbers = df.select_dtypes(exclude=['object'])

In [None]:
df_object.head()

In [None]:
df_numbers.head()

In [None]:
# counts the number of True values (which represent missing values), 
# sorts the values and returns ttop 15 columns with the highest number of missing values.
df_numbers.isnull().sum().sort_values(ascending=False)[:15]

In [None]:

df_numbers['lot_frontage'].fillna((0.0), inplace=True)
df_numbers['garage_yr_blt'].fillna((0.0), inplace=True)
df_numbers['mas_vnr_area'].fillna((0.0), inplace=True)
df_numbers['bsmt_half_bath'].fillna((0.0), inplace=True)
df_numbers['bsmt_full_bath'].fillna((0.0), inplace=True)
df_numbers['garage_cars'].fillna((0.0), inplace=True)
df_numbers['bsmtfin_sf_1'].fillna((0.0), inplace=True)
df_numbers['bsmtfin_sf_2'].fillna((0.0), inplace=True)
df_numbers['bsmt_unf_sf'].fillna((0.0), inplace=True)
df_numbers['total_bsmt_sf'].fillna((0.0), inplace=True)
df_numbers['garage_area'].fillna((0.0), inplace=True)

In [None]:
df_numbers.isnull().sum().sort_values(ascending=False)[:15]

In [None]:
df_numbers.head()

In [None]:
# list comprehension
features_list = [each for each in df_numbers.columns if each != 'saleprice']

In [None]:
X = df_numbers[features_list]
y = df_numbers['saleprice']

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [None]:
X_test.shape

In [None]:
y_test.shape

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
df_numbers_corr = list(pd.concat([X_train, y_train],
                          axis=1).corr()['saleprice'].sort_values(ascending=False).index[1:38])

In [None]:

df_ktest_numbers_corr = list(pd.concat([X_test, y_test],
                          axis=1).corr()['saleprice'].sort_values(ascending=False).index[1:38])

df_ktest = X_test[df_ktest_numbers_corr]
df_ktest.shape

In [None]:
sns.heatmap(X_train[df_numbers_corr].corr(), vmin=-1, vmax=1);

In [None]:
X_train = X_train[df_numbers_corr]
X_test = X_test[df_numbers_corr]

In [None]:
X_test.shape

In [None]:
X_train.columns

In [None]:
X_test.columns

#### Data Preprocessing 

**Preprocessing** in machine learning refers to the crucial initial phase of transforming raw data into a clean, structured, and suitable format for training and evaluating machine learning models. This process is essential because real-world data is often incomplete, inconsistent, noisy, or in a format incompatible with machine learning algorithms.


**PolynomialFeatures()** is a preprocessing tool within scikit-learn's preprocessing module in machine learning. It is used to generate polynomial and interaction features from existing features in a dataset. It's primary purpose is to allow linear models (like Linear Regression) to capture non-linear relationships in the data.


**StandardScaler()** in machine learning, particularly within the scikit-learn library, is a preprocessing technique used to standardize features by removing the mean and scaling to unit variance. This transformation results in a distribution with a mean of 0 and a standard deviation of 1. Many machine learning algorithms, especially those based on distance calculations (e.g., K-Nearest Neighbors, Support Vector Machines) or gradient descent (e.g., neural networks), are sensitive to the scale of features. StandardScaler ensures that all features contribute equally to the model, preventing features with larger numerical ranges from dominating those with smaller ranges.

In machine learning, **LogisticRegression()** typically often found within libraries like scikit-learn in Python. It is a supervised learning algorithm primarily used for binary classification tasks, although extensions exist for multi-class classification (Multinomial and Ordinal Logistic Regression). It models the probability of a binary outcome (e.g., spam/not spam, disease/no disease) based on a set of input features.


A machine learning (ML) **pipeline** is a structured means of automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs

**Lasso** (Least Absolute Shrinkage and Selection Operator) is a regularization technique in machine learning, particularly useful for linear regression models. It performs both feature selection and regularization, helping to prevent overfitting and improve model interpretability. Lasso achieves this by adding a penalty term to the model's loss function, which shrinks some coefficients towards zero, effectively removing some features from the model. 

In [None]:
X_train.shape

In [None]:
y_test.shape

In [None]:
X_train.columns

In [None]:
# LassoCv model using only num variables and replacing nulls with 0 for each variables
pipe = Pipeline([
    ('pf', PolynomialFeatures()),
    ('ss', StandardScaler()),
    ('lcv', LassoCV(n_alphas=500, max_iter=1000))
])

pipe.fit(X_train, y_train)

print(pipe.score(X_train, y_train))
print(pipe.score(X_test, y_test))


In [None]:
pipe.named_steps['lcv'].alpha_

In [None]:
X_test.head()

In [None]:
# replace spaces in column names and convert all columns to lowercase:
df_ktest.columns = [x.lower().replace(' ','_') for x in df_ktest.columns]

In [None]:
df_ktest.head()

In [None]:
df_ktest_numbers =df_ktest.select_dtypes(exclude=['object'])

In [None]:
df_ktest_numbers.shape

In [None]:
df_ktest_numbers.isnull().sum().sort_values(ascending=False)[:15]

In [None]:
df_ktest_numbers['lot_frontage'].fillna((0.0), inplace=True)
df_ktest_numbers['garage_yr_blt'].fillna((0.0), inplace=True)
df_ktest_numbers['mas_vnr_area'].fillna((0.0), inplace=True)


In [None]:
df_ktest_numbers.isnull().sum().sort_values(ascending=False)[:15]

In [None]:
X_train.shape

In [None]:
df_ktest_numbers.shape

In [None]:
# probably need to drop 'id' from both

In [None]:
X_train[sorted(X_train.columns)].head()

In [None]:
df_ktest_numbers[sorted(df_ktest_numbers)].head()

In [None]:
# ^ (Symmetric difference) This is the symmetric difference operator for sets in Python.
# It returns a new set containing all elements that are in either of the two sets but not in their intersection. 
# In other words, it reveals the column names that are unique to one set or the other. 

set(df_ktest_numbers.columns) ^ set(X_train.columns)

In [None]:
X_train.columns

In [None]:
df_ktest_numbers.columns
df_ktest_numbers.columns

In [None]:
df_ktest_numbers = df_ktest_numbers[pipe.feature_names_in_]

In [None]:
preds = pipe.predict(df_ktest_numbers)

In [None]:
ids = df_ktest_original['Id'][:513]

preds_df = pd.DataFrame({
    'Id': ids,
    'saleprice': preds
})
preds_df.head(10)

In [None]:
import datetime

now = str(datetime.datetime.now())

f'predictions_{now}'

now1 = str(datetime.datetime.now())
preds_df.to_csv('kaggle_Preds_{now1}', index=False)

In [None]:
# pd.read_csv('kaggle_Preds_{now1}')