# Using Machine Learning Tools: Workshop 2

**Chapter 1 – The Machine Learning landscape**

This is a modified version of the code accompanying Chapter 1 of 
_Hands-on Machine Learning with Scikit-Learn, Keras and TensorFlow, 2e_ by Aurelien Geron

--------------------------------------------------------------------

First check we have the required Python libraries.

Although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead.

In [1]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

In [2]:
# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

In [3]:
# To plot pretty figures directly within Jupyter
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

In [4]:
# Code example
import numpy as np
import pandas as pd
import sklearn.linear_model
%matplotlib inline  

## Markdowns provided correspond to Questions in the "Workshop 2: Dealing with data" page in my uni. - Question 1. Read in the CSV file into a Pandas DataFrame Links **


In [5]:
# Load the data using a pandas function
housing = pd.read_csv("workshop2.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'workshop2.csv'

## Question 2. Use the DataFrame functions such as head(), info(), describe() to get a quick overview of the data. Pay attention to the type, count and range of each feature.

In [None]:
housing.head()

In [None]:
housing.info()

In [None]:
housing.describe()

## Question 3. Handling Missing/Invalid Data

In [None]:
housing.drop(columns=['ocean_proximity'],inplace=True)
housing.head()
print(housing.iloc[:,:]) #printing a portion of them directly using the Pandas iloc[] function

In [None]:
print(np.sum(housing.isna()))

In [None]:
# The following is what I use (and I put it in the PythonCookbook)
# Notice the difference between the outputs of this and the previous one
np.sum(np.isnan(housing.apply(pd.to_numeric, errors="coerce")))

In [None]:
# Convert to numerical, then drop all problem rows (this is only one option for dealing with these)
housing = housing.apply(pd.to_numeric, errors="coerce")
housing.info()
print(np.sum(housing.isna()))
housing.dropna(inplace=True)  # Comment this out if you want to use imputation later on
housing.info()

## Question 4. Visualization with the help of plots
**One observation: if have nan values (i.e if you have commented the "#housing.dropna(inplace=True) in Question 3) and we try to plot, the box plot will be empty for the total_rooms and total_bedrooms, because they have NaN values.
These get handled in the imputation step later. If you uncomment and run it, the missing box plots won't be an issue since the rows with NaNs are removed.**

In [None]:
# Visualise using boxplots
for n in range(housing.shape[1]):
    plt.boxplot(housing.iloc[:,n],vert=False)
    plt.title(f'{housing.columns[n]}: {np.sort(housing.iloc[:,n])}')
    plt.show()

In [None]:
# Visualise using sorted values
for n in range(housing.shape[1]):
    print(f'{housing.columns[n]}: {np.sort(housing.iloc[:,n])}')
    plt.plot(np.sort(housing.iloc[:,n]),'-o')
    plt.show()

In [None]:
# Visualise using histograms
dummy = housing.hist(bins=40,figsize=(15,10))

In [None]:
# Alternative using matplotlib
plt.figure(figsize=(15,12))
nfeat = housing.shape[1]
ncol = 3
nrow = int(np.ceil(nfeat/ncol))
for n in range(nfeat):
    plt.subplot(nrow,ncol,n+1)
    plt.hist(housing.iloc[:,n],bins=40)
    plt.title(f'Feature name: {housing.columns[n]}')
plt.show()

Now fix various problems with features, as identified from the above visualisations and descriptive tables

**In this step we are re-checking for any invalid data and converting it to “nan” (although this step is redundant as we have already converted those values to “nan” prior to visualisation, we are adding another layer of scrutinization)**


In [None]:
# Fix features ... chosen by examining the plots and descriptions above
# Unfinished parts in this cell and later are indicated with question marks
# In this step we are re-checking for any invalid data and converting it to “nan” (although this step is redundant as we have already converted those values to “nan” prior to visualisation, we are adding another layer of scrutinization)

bad_vals = housing[housing.iloc[:, 3] == "??"].index  # Finding rows with "??" in total_rooms 
housing.iloc[bad_vals, 3] = np.nan  # Replacing them with NaN
print(bad_vals)  # This will show indices of rows where "??" was found
print(housing.iloc[bad_vals])  # Check the affected rows

housing.describe()


In [None]:
# Take a copy of the dataframe
# Note that this is not a true *deep* copy, as not all lower level structures are copied
# See copy.deepcopy for a true deep copy (though we don't need this now)
housing_copy = housing.copy()

## Question 7. Imputation Process

In [None]:
# Estimate medians now that data is tidied up (though it would not change much)
medians = np.nanmedian(housing,axis=0) # if axis = 0, it calculates median column wise, if axis = 1, row wise calculation of median, In our case we need cloumn wise calculation.
print(medians)
housing.info()
print(housing.shape)

In [None]:
# Perform imputation using median values (it is critical that this is done _after_ tidying data up)
for n in range(housing.shape[1]):
    housing.iloc[:,n] = housing.iloc[:,n].fillna(medians[n])
housing.describe()
housing.info() #notice 20640, this infers all the missing values were filled using median value of the respective column/feature.

In [None]:
# An alternative way to do imputation
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
housing.info()
print(np.sum(np.isnan(housing),axis=0))
print(np.sum(np.isnan(housing.to_numpy()),axis=0))
housing_np = imputer.fit_transform(housing)
print(np.sum(np.isnan(housing_np),axis=0))

## Question 5. Splitting the data into training and test sets with an 80/20 split

In [None]:
import sklearn.model_selection

Ndata = housing.to_numpy()
X_all = Ndata[:,:-1]
y_all = Ndata[:,-1]
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X_all, y_all, test_size=0.2)

## Question 6. Training a linear regression model

In [None]:
model = sklearn.linear_model.LinearRegression()
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
plt.scatter(y_test,y_pred)
plt.plot(y_test,y_test,'k')

## Question 8. The Importance of Handling Missing Data

In [None]:
# Code to see how many good rows are left if a small amount of data corruption occurs with multiple features
ns = 400   # don’t use np for number of people, as it hides the np that stands for numpy!
nf = 100   # number of features
errate = 0.008   # error rate
vals = np.random.rand(ns,nf)   # uniform random numbers in [0,1]
errs = vals<errate  # Is True for specified error rate
nerrs = np.sum(errs,axis=1)   # number of errors per subject
print(np.sum(nerrs==0)/ns*100)   # percentage of subjects with complete records (no errors)