#**Introduction**

This lab exercise serves to introduce the concepts of Supervised Machine Learning, Decision Trees, and various forms of model evaluation we can use to analyze and improve the performance of machine learning models.

The fundamental goal of this lab is to use the Random Forest (RF) model from the Tensorflow machine learning library to analyze a given set of data and predict a certain value on a future unseen test set. We will use this Random Forest model to read in several attributes of different passengers on board a ship, and use these data fields to predict whether or not the passengers were transported to a different location. This particular model is suited well for labeled, tabular data, which we observe in our provided data sets.

The machine learning process begins with analyzing and preprocessing our provided data to ensure it is of a valid format to be accepted by the Random Forest model we will instantiate. We will then split our training data into training and validation sets to evaluate the model based on several metrics, including accuracy, precision and recall. We will then "fit" our model to the training set of data and use this trained model to predict future values within the validation and test sets. This process of analysis and performance evaluation will then continue, as we further process and shift our data to recognize any particularly interesting features and patterns within our data, until we reach a trained and predictive model which successfully predicts whether or not an unseen passenger was transported or not.




#**Data Exploration**
The initial machine learning process often begins with exploring the data we have collected to gain an intuition on any existing patterns, observe any missing data fields, and see what types of data we are working with. We begin with our train.csv file, containing the data that was partitioned to train the model, and initially see there are 14 attributes within our provided training data. These fields, some of which are listed below, represent attributes for each passenger we are provided.
- `Name`
- `Age`
- `PassengerId` - gggg_pp, where gggg=group and pp=posiition - composite
- `HomePlanet` - Where the Passenger is from
- `Destination` - Where the Passenger is going
- `Cabin` - /Deck/Cabin_num/Side - composite
- `Transported` - Boolean value, what we are trying to predict in the test set

We also see this `Transported` attribute is not present when we inspect the `test.csv`, the data partioned to test the model's predictive ability, as we are attempting to use the passenger data to predict if the passengers in the test set were transported to an alternate location or not.

##**Preprocessing the Data**
Before we begin to read in our data and train the model to the training set, we must process the data so it is in a readable format for the Random Forest model. Within the field of machine learning, and specifically the practice of collecting and recording data, errors can become increasingly more common as the size and complexity of the data grows. Given that our dataset is around 12,000 rows and 14 columns per row, the number of individual values is quite large and can lead to missing values within our data set. The RF model is able to handle missing values within categorical attributes, such as the Name and HomePlanet attributes, but it is not as well suited for integer or boolean handling. Therefore, our first step in the preprocessing of our data is to convert each boolean to an integer value, and then handle any missing integer values within our dataset.

```python
csvFile = open('train.csv', 'r')
outFile = open('trainProcessed.csv', 'w')
reader = csv.DictReader(csvFile, delimiter=',')
lineCount=0
headerCount=0
#list of dictionaries which store each row of data from CSV input file
dictList = []

#create a dictionary for each row in our CSV and add it to our list of dictionaries
for row in reader:
    dictList.append(row)

```

The code referenced above depicts the opening of our `train.csv` file which contains all of the unprocessed original training data, and the opening of the of the `trainProcessed.csv` which will hold the output of our first iteration of changes to the data. We then use the `csv.DictReader` function from the `csv` library to read in the CSV data into the dictionary `reader`.

We then iterate through each row from the `train.csv` file within our reader data structure and append each row to the `dictList` list we declared to hold each row of data. This allows us to access each individual row/passenger within the set and access any boolean or missing values directly. It also allows us to create new attributes for each passenger when we move to decomposing any composite attributes. Now that we have stored each individual passenger and their data, we can move to processing the data in order to feed it into our model.

###**Boolean Conversion & Attribute Decomposition**

```python
#add new keys needed to split the composite cabin and passengerid columns
for item in dictList:
    item["Deck"] = " "
    item["Cabin_num"] = " "
    item["Side"] = " "
    item["Group"] = " "
    item["Position"] = " "

#iterate through each key/item for each dictionary in our list
for item in dictList:
    for key in item:
        #convert boolean values to integers - helps the random forest model
        if item[key]=="True":
            item[key]=1
        elif item[key]=="False":
            item[key]=0
            #place holder for missing data
        elif item[key]=="":
            item[key]="NaN"
            #when we find the cabin information, split by the / character to get each deck, cabin_num and side value
            #assign those three values to our new keys - helps random forest model
        elif key == "Cabin":
            cabinList = item["Cabin"].split('/')
            item["Deck"] = cabinList[0]
            item["Cabin_num"] = cabinList[1]
            item["Side"] = cabinList[2]
            #split the passenger id into group and position in the same fashion
        elif key == "PassengerId":
            IDList = item["PassengerId"].split('_')
            item["Group"] = IDList[0]
            item["Position"] = IDList[1]

        #remove redundant data/keys
    del item["Cabin"]
    del item["PassengerId"]

    #iterate through each dictionary in our lists
#create an instance of a CSV dictionary writer
#write our keys as the header/first column, then write the values for each dictionary for all following lines
#produces the processed data into a CSV output file
writer = csv.DictWriter(outFile, item.keys())
writer.writeheader()
for item in dictList:
    writer.writerow(item)
    #print(item)

outFile.close()
#download the csv that we have preprocessed
#still need to upload this file so we have
files.download("trainProcessed.csv")

```

From our initial insights into the data set, we recognize there are two main composite attributes for each passenger, namely `PassengerId` and `Cabin`. Composite attributes often make it easier to measure and record the data, but can also lead to skewed results during model training and evaluation as information can be lost when we combine data into a single field. To avoid losing any potential patterns and detail, we chose to break up these composite fields as follows.

 - `PassengerId` -> `Group`, `Position`
 - `Cabin` -> `Deck`, `Cabin_num`, `Side`

We first iterate through each `item` in our list of data, with each item being a separate dictionary for each passenger. We then add these new attributes to each item to prepare for the decomposition of the composite attributes.

We then use a nested loop to iterate through each `item` in our `dictList`, and then each `key`, or individual field, in each `item`, or passenger. We then use this nested loop to convert any boolean values to integers, convert any missing values in the data to the value `NaN` so our model can handle them, and split our two composite attributes into their respective singular fields. Each "True" value goes to 1 and each "False" value goes to 0. This boolean conversion allows the model to better comprehend the patterns between the boolean fields and other fields, and allows us to use interpolation in the future to handle any missing boolean/integer values. We then use the split() function to split the `PassengerId` and `Cabin` by the `_` and `/` characters respectively.

After converting each boolean field to an integer, decomposing any composite attributes, and making each missing value `NaN`, we then delete the two composite attributes to avoid and redundant information within our model. We then use the `DictWriter` from the `csv` library to write each passenger dictionary in our list of data to the `trainProcessed.csv` output file, and then use the `files.download()` function to download this initial processed version of our data. This allows us to keep each track of step in our data examinatio process on our own system and then upload these files later when we move to data interpolation and training our model.

###**Data Imputation & Examining Missing Values**

Different models within machine learning use various methods to solve the problem of missing data depending on the frequency of missing values, size of the data set, and any existing patterns/dependencies between missing fields. Most missing value solutions are classified as either deletion, where one deletes the missing values as a whole, or imputation, where one uses the existing data set to make inferences about the missing values. To decide which route to take, we need to gather metrics about the frequency of any missing values within the set and find any patterns between them.

```python
import pandas as pd
import missingno as msno

#function to print out the data for any missing values we have within input
def getMissingData(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns

```
The `getMissingData()` function displayed above uses the `pandas` imported library to read in a pandas dataset, which will be satisfied by our training set once we read in the `train.csv` file, and prints out the frequency of missing values across the whole set. This function was also used to categorize each missing value by the field name to determine how many missing values there were for each attribute. The function results, displayed in Fig. 1 (attached folder), indicate there are a few thousand missing values in total, evenly spread out over 12/14 of the attributes in the `train.csv` file. Given this large number of missing values, we chose to use data imputation to handle the missing data rather than deletion, as deletion would lead to too great of a loss of information which could negatively impact the model's performance.

Now that we had chosen data imputation to handle the missing data, we needed to choose a specific method of imputation. There are three popular forms of imputation, including front fill, back fill, and interpolation. Both front fill and back fill replace the missing field with the values directly after and before the missing data value, respectively. On the other hand, interpolation uses the entire data set to infer what the missing values are based on underlying patterns within each set. To decide which method to use, we employed the `missingno` imported library to measure certain dependencies within each data field as shown below. This library allows us to generate various graphs that depict the dependency and patterns between each attribute, where low dependency indicates a greater need for interpolation as we need to predict the missing values rather than just fill them in with the nearest data fields. Using the `matrix()`, `heatmap()` and `dendrogram()` functions from this `missingno` library, we can see the lack of strong dependency between any two data fields and very little correlation within each attribute. These results, shown in Fig. 2, Fig. 3, and Fig. 4, led us to chose data interpolation to handle our missing values.

```python

files.upload() #upload trainProcessed.csv to interpolate the missing data
trainSet = pd.read_csv("trainProcessed.csv")
#use interpolation to predict the missing values
#still leaves some categorical data missing, but handles most of missing integer/bool val
trainSet.interpolate(limit_direction="both", inplace=True)
#write the interpolated trainSet to a csv file
trainSet.to_csv("trainFull.csv")
#download the processed csv file - all the missing integer/boolean data has now been predicted through interpolation
files.download("trainFull.csv")

```

The code shown above depicts uploading and reading the output file from earlier, which contains the converted boolean-to-integer values, NaN values, and the decomposed `PassengerId` and `Cabin` attributes. We then use `pandas` again to read in this CSV file with the expression `pd.read_csv()`, and then use this pandas object's `interpolate` method. This then uses the existing data within our CSV to predict the values for each `NaN` value, allowing us to remove any missing gaps in our data. This then completes the preprocessing steps for our data, allowing us to use this `trainFull.csv` file to train our Random Forest model.



###**Reading In the Training Set & Data Examination**
We know from initial observations that the training data is roughly 2/3 of our total data, meaning we have a ~67/33 split on our training and testing data. We want to continue the machine learning process by reading in our training data from the provided train.csv file to begin the training process. In order to read in the data from the CSV format into a structure our model can accept, we need to utilize various imports provided by the python libraries.

```python
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import time
from tensorflow import keras
import keras
from google.colab import files
from sklearn.model_selection import train_test_split
import numpy as np
import csv

```
By using the pandas, tensorflow, csv and keras libraries, we are able to read in our csv-formatted data and convert it into structures that the tensorflow Random Forest model will be able to accept. We also import the files library from google.colab, allowing us to upload and download specified files during run time. This will allow us to upload our necessary data into temporary directories within Google Colab before reading them in using the pandas library.

```python
print("upload trainFull.csv")
files.upload()
dataset_df = pd.read_csv("trainFull.csv")
print(dataset_df.describe(include = "all"))

```
We are able to upload the fully processed `trainFull.csv` file and read it into our `dataset_df` object using the `pandas` library. Before we use this data to train the model, we want to further examine it and find any more underlying patterns or features within the data. We can use the `describe` function to print out various statistics and metrics about each attribute, including the mean, max, min, frequency, and any missing values for each column. We also are able to determine the distribution of various attributes, including the `Transported` field, to gain insights into what our potential predictions will look like. We discovered there is roughly a 50/50 split (4378 true values and 4315 false values) between those that were transported and those that were not, meaning our predictions should most likely resemble this 50/50 pattern.

#**Dataset Splitting**

Now that we have fully preprocessed our data and read in the processed training data, we move to splitting this data into training and validation sets. The training sets are often ~80% of the training data and is used to train and fit the model, with the remaining ~20% being reserved to test and evaluate the model on unseen data. This allows us to see how the model is performing and how accurate the predictions are compared to the true values that are present in the training data.

```python
#split the training set into 80% to train and 20% to validate on
trainSet, validSet = train_test_split(dataset_df, test_size=0.2)

#convert the split pandas sets to tensorflow datasets so we can train the Random Forest model with them
trainSetTF = tfdf.keras.pd_dataframe_to_tf_dataset(trainSet, label="Transported")
validSetTF = tfdf.keras.pd_dataframe_to_tf_dataset(validSet, label = "Transported")

```

As depicted above, we use the imported train_test_split() function to split the `dataset_df` pandas dataset into 80% for training and 20% for validation. We then must convert these `trainSet` and `validSet` data sets into tensorflow sets as our Random Forest model can only accept tensorflow datasets. We use the function `tfdf.keras.pd_dataframe_to_tf_dataset()`, passing our train and validation sets along with specifying the label as `Transported`. This signifies to the tensorflow datasets and the model once it is trained that the `Transported` attribute is the value that the model will try to predict and will be evaluated on.

#**Model Selection & Training**

Now that we have split our processed data into training and validation sets and converted them into tensorflow objects, we can move to creating and training our model. We chose the Random Forest (RF) model from the `tensorflow_decision_forests` imported library. The RF model is made up of multiple individual decision tress that as a whole attempt to answer a single question through the analysis of attributes found within the training data. Because we are using this model to answer a single question, "Was this passenger transported?", it can be applied to this task rather easily. Decision trees answer the main question being asked by constructing nodes to answer smaller questions in order to reach the overall answer. For example, one might use decision trees to ask "will it rain?", and the tree might be constructed of nodes to ask "Is it cloudy?", "Is the temperature less than 60 degrees?", etc. These yes/no questions serve as the nodes of the tree and the model makes decisions at each one in order to arrive at a conclusion. The Random Forest model uses several of these decision trees at once, all working to answer the singular question of "Was the passenger transported?". This RF model allows us to reduce the chance of overfitting to the training data set due to the increased number of trees in use which leads to improved performance. The main negative of using the RF model is that it increases the space and time complexity of the model making it fairly time-consuming and complex, but because our dataset is not too large and we are trying to maximize accuracy as opposed to efficiency, the RF model is well suited for this problem.

After choosing the RF model, we moved to compiling and training, or "fitting" the model to the processed training data set. We created an instance of the model and used the converted tensorflow dataset `trainSetTF` as our input as shown below.

```python
#create instance of the RF decision tree model - supervised learning since we're using labels for each data attribute
model = tfdf.keras.RandomForestModel()

#configure the model before training, set the metrics we want to measure
model.compile(metrics=["accuracy", keras.metrics.Precision(), keras.metrics.Recall()])
#train the model using the converted training data from the trainFull.csv
model.fit(trainSetTF)
```
Using the `compile()` function, we can declare the main metrics we want to use to measure the performance of our RF model, choosing accuracy, precision and recall. With accuracy measuring the number of correct predictions to the total number of predictions, precision measuring the number of positive predictions that were actually correct, and recall measuring the number of actual positive predictions that were identified correctly, we can create a complete system of evaluation.

After compiling the model we can use the `fit()` function to train the model, passing in the reserved training set in `trainSetTF`. This then returns a fully trained model that we can use to predict future unseen data, which we will first pass our validation set and then our test set to measure the performance of the trained model.


#**Model Evaluation**
Now that we have created an instance of the RF model and trained it on the reserved training data, we now pass it the validation set using the `predict()` and `evaluate()` functions.

```python
predictions = model.predict(validSetTF)
i=1
for prediction in predictions:
  roundedVal = prediction.round()
  if roundedVal==1:
    predictedBool = True
  else:
    predictedBool=False
  #print(f'{i} raw value is {prediction}, rounded val is {roundedVal}, prediction is {predictedBool}')
  i+=1
#print out the declared metrics to evaluate accuracy/performance
model.evaluate(validSetTF)
model.summary()
```
The `predict()` function returns a set of predictions in the form of a float value ranging from 0-1, as we pass it the validation set to process the remaining 20% of the training data. The model is then able to use its training to predict the value of the `Transported` attribute for each passenger in the validation set, and then compare it to the actual value from that set. We can then iterate through these predictions, converting each float value to the boolean prediction we need to match the format of the `Transported` attribute for each passenger. This allows us to print out each prediction and then use the `evaluate()` function to print out the performance for each declared metric based on how the model performed against the expected value from the validation sets.

##**Model Performance**

Using the `evaluate()` function, we can pass the `validSetTF` validation set and print out the model's performance in accuracy, precision and recall. The first iteration of training our model produced fairly subpar performance, with each of our 3 metrics measuring in at around 73%. We then adjusted the ratio of training:validation data, which is when we made the `test_size` variable in our `train_test_split()` function 20%, where as before it was 33%. We originally wanted to mirror the ratio between the provided `train.csv` and `test.csv`, but increasing the percentage of training data to 80% produced a nearly 7% increase in our accuracy and precision and a 5% increase in our recall. The increased amount of data that the model was given to train meant it could better learn and recognize underlying patterns within the data, allowing it to evaluate the validation set with improved accuracy.

We then continued to train the model on the training set in an attempt to increase our performance even further, but the model was not able to eclipse 82% accuracy and we wanted to avoid any potential overfitting to the training set.


###**Model Summary - OOB Data**

After reaching a point where the model would not improve in any of our 3 metrics, we wanted to view the model's performance on Out-of-Bag (OOB) data to gain insight on the generalization ability of the model. Machine learning models often perform very well on training data, but we often care about the model's applicability to data it has never seen before. Using OOB data allows us to produce a score on how well the model can be generalized to unseen data.

Using the `summary()` function, we can print several valuable pieces of information about our model, including the OOB score at each step in development. As the RF model trains and fits itself to the training data, it continuously creates more and more nodes and subtrees to aid in the predictive learning process. The `summary()` function displays the OOB score in terms of accuracy at each step in the model's growth, starting with 1 node and ending with 300, as shown in Fig. 5. This OOB score of the model remained relatively consistent at ~79% throughout each step, mirroring the accuracy of the model as a whole.


#**Feature Importance Analysis**

Now that we have trained and evaluated our RF model using the training and validation sets, respectively, we can use the same `summary()` function to evaluate which attributes/features within the data were most influential in the results of the model. This function prints out the level of influence that each feature or "variable" has on the overall model's ability to predict and learn from the data. The function takes into account the number of nodes associated with each attribute and the depth of various subtrees for each attribute, and generates an overall `SUM_SCORE` value which ranks each attribute. Depicted in Fig. 6, we can see the most influential feature on the model was the "Cabin_num" value, describing where each passenger stayed on the ship. The model's summary also named the `CryoSleep`, `Spa`, `Group` and `Deck` as the next most influential features. This indicates the relationship between passengers who stayed at a particular location on the ship, like those who remained in CryoSleep for the duration of the ship's journey, and the predicted `Transported` value showed a strong positive correlation. This feature analysis can serve to be very valuable in future model developments and prioritizing certain attributes in future training iterations.

#**Test Set Predictions**

After training and evaluating the RF model on the training data, we now move to finally using the model to predict the `Transported` values for each passenger within our `test.csv` file.

```python
print("upload preTest.csv")
files.upload()  
testSet = pd.read_csv("preTest.csv")
#convert the preTest csv with the added Transported column into a tf dataset for the model to evaluate/predict
test_data = tfdf.keras.pd_dataframe_to_tf_dataset(testSet, label="Transported")
```

We repeat the steps as above to read in the `preTest.csv`, which holds the fully processed testing data (processed as above to convert booleans to integers, handle missing values, etc) and convert the pandas dataframe to a tensorflow structure so our RF model can accept it as input. We then use the `predict()` function, passing this `test_data` variable to get the trained model's prediction for each passenger in the test set.

```python
#Begin to test model with testing set to predict the transported values
testPreds = model.predict(test_data)

#list to hold each true/false transported prediction for each user
predictionsList = []
count=1

#iterate through each prediction from the predict() function to print out true/false value for each passenger
for pred in testPreds:
  roundedValTest = pred.round()
  #convert the model's integer predictions into boolean values
  if roundedValTest==1:
    predictedBoolTest = True
  else:
    predictedBoolTest= False
  #print(f'raw value {count} is {pred}, rounded val is {roundedValTest}, prediction is {predictedBoolTest}')
  #add each predicted true/false transported value to a list to store all of the predictions
  predictionsList.append(predictedBoolTest)
  count+=1

model.evaluate(test_data)
```
After gathering the predictions for each test set passenger, we repeat the steps to convert each prediction to a boolean value, and store each boolean value in our `predictionsList` list so we can pair each prediction with the corresponding `PassengerId`.

```python
testCSV = pd.read_csv("test.csv")

#use a list to store each dict pair of {passId, transportedPrediction}
resultsList = []
passIDCount=0
#iterate through each prediction in our list of transported predictions
#create a dict mapping for each PassengerId and the passenger's prediction
#add this dict to our list of results, repeat
for boolPred in predictionsList:
  results = {}
  results["PassengerId"] = testCSV.PassengerId[passIDCount]
  results["Transported"] = boolPred
  resultsList.append(results)
  passIDCount+=1

#write the final list of predictions to a submission csv file, recording each PassengerId and their predicted true/false value
finalPredsFile = open("sample_submission_format.csv", 'w')
finalWriter = csv.DictWriter(finalPredsFile, results.keys())
finalWriter.writeheader()
for pair in resultsList:
  finalWriter.writerow(pair)

files.download("sample_submission_format.csv")
```
After storing each boolean prediction for each test set passenger, we read in the data from our original `test.csv` to get access to each `PassengerId` for each passenger. Our final submission CSV file will hold the `PassengerId` and `Transported`, so we need to store each value in a dictionary for each passenger. We iterate through the `predictionsList` and for each prediction we create a `results` dictionary. We then append the `PassengerId`, keeping track of each ID using the index variable `passIDCount` and the current predicted boolean value within our loop. Once each dictionary is filled, we append it to our `resultsList` to hold each prediction for each passenger. We then use the `DictWriter` function within the `csv` imported library to write each passenger pair to the final output file `sample_submission_format.csv` and download the file on our own local system.


#**Conclusions**
Using the Random Forest, we were able to construct, train, evaluate and implement a model to learn from training data and employ this learning ability to predict values on unseen test data. The model was able to predict values on our validation set with roughly 80% accuracy, precision and recall, and was able to successfully predict a single value for thousands of passengers within our testing data set. Although the model was fairly average in terms of performance, the Random Forest model serves as a solid foundation for a machine learning approach to this problem. We were able to use various evaluation and summarizing functions to determine the final structure of our Random Forest decision tree ensemble, measure the model's ability to generalize on unseen data using OOB predictions, and measure the influence individual attributes had on the overall model. This knowledge and general base level of accuracy combine to provide an overall successful model with respect to creating a predictive system to learn from training data and use this learning to predict values on a test set.

###**Challenges**
The main challenges we faced during the development process of our model was in the data processing step and in trying to find the most appropriate form of evaluation for the model. The level of research and investigation into the data to determine the best method to handle missing values, and understanding the data structures of both our provided data and the necessary formats of the pandas, tensorflow and keras data structures within our code presented a difficult challenge to comprehend at first. It was also difficult to determine the best ways to evaluate the model with respect to determining the mode's performance on OOB data and determining feature importance. However, using the provided functionality within our various imported libraries in the `csv`, `pandas` and `tfdf` libraries along with using the functions that accompany our instantiated `model` variable aided in solving these challenges.


###**Future Analysis**

The future work of this model should focus on using the measured feature importance to emphasize different attributes within training. We were able to determine the significance of each attribute to the model's final predictions, so providing certain weights to these attributes could help the model adapt to this specific problem and the certain format of data.


#**References**
 - [1]Google Developers, “Classification: Precision and Recall  |  Machine Learning Crash Course  |  Google Developers,” Google Developers, Mar. 05, 2019. https://developers.google.com/machine-learning/crash-course/classification/precision-and-recall
‌
 - [2]IBM, “What is Random Forest? | IBM,” www.ibm.com, 2023. https://www.ibm.com/topics/random-forest
‌
 - [3]S. K. Dash, “Handling Missing Values with Random Forest,” Analytics Vidhya, May 04, 2022. https://www.analyticsvidhya.com/blog/2022/05/handling-missing-values-with-random-forest/
‌

 - [4]S. Secherla, “Different Imputation Methods to Handle Missing Data,” Medium, Jun. 13, 2021. https://towardsdatascience.com/different-imputation-methods-to-handle-missing-data-8dd5bce97583
‌
 - [5]“A Guide to Handling Missing values in Python,” kaggle.com. https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python
‌
 - [6]“Ways to import CSV files in Google Colab,” GeeksforGeeks, Jul. 01, 2020. https://www.geeksforgeeks.org/ways-to-import-csv-files-in-google-colab/
‌
 - [7]“Dealing With Missing Values in Python,” Analytics Vidhya, May 19, 2021. https://www.analyticsvidhya.com/blog/2021/05/dealing-with-missing-values-in-python-a-complete-guide/
‌


###**Appendix**
There is some slight manual work the user has to do to run both programs. To process the training data, you must change the `csvFile` variable in the first program to your training data file name and the accompanying output files to the file names you want your processed training data to be. The file names are currently suited for reading in, processing and predicting test data.

In [None]:
#Use this code block to process the train and test data before we run it through the model
#Make sure to change the file names within the code if needed

import os
import numpy as np
import csv
import time
from google.colab import files

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn import preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression

import pandas as pd
import missingno as msno

#function to print out the data for any missing values we have within input
def getMissingData(df):
        # Total missing values
        mis_val = df.isnull().sum()

        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)

        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)

        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})

        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)

        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")

        # Return the dataframe with missing information
        return mis_val_table_ren_columns


#read in csv data using pandas, use our missingData() function to print out metrics of missing data
#------------------------------------------------------------------------------------------
#trainSet = pd.read_csv("printTrain (1).csv")
# trainMissing = getMissingData(trainSet)
# print("For Training Data")
# print(trainMissing)

# print("---------------------------------------------")

# testMissing = getMissingData(testSet)
# print("For Testing Data")
# print(testMissing)

#expressions to print out missing data values
# msno.matrix(sortedTrain)
# msno.heatmap(sortedTrain)
# msno.dendrogram(sortedTrain)
# msno.matrix(testSet)

#grab the input, preprocess the data by converting booleans->integers and breaking up composite attributes
print("upload test.csv")
files.upload() #--upload the train.csv
#open our files to read the data and output the altered data - helps the model
csvFile = open('test.csv', 'r')
outFile = open('testFirstProc.csv', 'w')
reader = csv.DictReader(csvFile, delimiter=',')
lineCount=0
headerCount=0
#list of dictionaries which store each row of data from CSV input file
dictList = []

#create a dictionary for each row in our CSV and add it to our list of dictionaries
for row in reader:
    dictList.append(row)

#add new keys needed to split the composite cabin and passengerid columns
for item in dictList:
    item["Deck"] = " "
    item["Cabin_num"] = " "
    item["Side"] = " "
    item["Group"] = " "
    item["Position"] = " "
    item["Transported"] = " "

#iterate through each key/item for each dictionary in our list
for item in dictList:
    for key in item:
        #convert boolean values to integers - helps the random forest model
        if item[key]=="True":
            item[key]=1
        elif item[key]=="False":
            item[key]=0
            #place holder for missing data
        elif item[key]=="":
            item[key]="NaN"
            #when we find the cabin information, split by the / character to get each deck, cabin_num and side value
            #assign those three values to our new keys - helps random forest model
        elif key == "Cabin":
            cabinList = item["Cabin"].split('/')
            item["Deck"] = cabinList[0]
            item["Cabin_num"] = cabinList[1]
            item["Side"] = cabinList[2]
            #split the passenger id into group and position in the same fashion
        elif key == "PassengerId":
            IDList = item["PassengerId"].split('_')
            item["Group"] = IDList[0]
            item["Position"] = IDList[1]

        #remove redundant data/keys
    # del item["Cabin"]
    # del item["PassengerId"]
    # print(item)
    # print()

#iterate through each dictionary in our lists
#create an instance of a CSV dictionary writer
#write our keys as the header/first column, then write the values for each dictionary for all following lines
#produces the processed data into a CSV output file
# d = dictList[0]
# print(dictList[0])
writer = csv.DictWriter(outFile, item.keys())
writer.writeheader()
for item in dictList:
    writer.writerow(item)
    #print(item)

outFile.close()
#download the csv that we have preprocessed
#still need to upload this file so we have
files.download("testFirstProc.csv")

print("If this is the first time running the program, you may need to select 'cancel upload'")
print("It will then download trainProcessed.csv, then you can run the program again and select it from your downloads")
print("upload the testFirstProc.csv")
files.upload() #upload trainProcessed.csv to interpolate the missing data
trainSet = pd.read_csv("testFirstProc.csv")

#use interpolation to predict the missing values
#still leaves some categorical data missing, but handles most of missing integer/bool val
trainSet.interpolate(limit_direction="both", inplace=True)
# #delete the extra index column
# firstCol = trainSet.columns[0]
# trainSet = trainSet.drop([firstCol], axis=1)
#write the interpolated trainSet to a csv file
trainSet.to_csv("testFull.csv", index=False)
#download the processed csv file - all the missing integer/boolean data has now been predicted through interpolation
files.download("testFull.csv")










In [None]:
#Use this code after you have processed the data in the code above
#This code will need the processed data CSV files to train and test the model properly

import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import time
from tensorflow import keras
import keras
from google.colab import files
from sklearn.model_selection import train_test_split
import numpy as np
import csv
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt


print("upload testFull.csv")
files.upload()
testSet = pd.read_csv("testFull.csv")

print("upload trainFull.csv")
files.upload()
dataset_df = pd.read_csv("trainFull.csv")

print("upload test.csv")
files.upload()


#split the training set into 80% to train and 20% to validate on
trainSet, validSet = train_test_split(dataset_df, test_size=0.2)

#convert the split pandas sets to tensorflow datasets so we can train the Random Forest model with them
trainSetTF = tfdf.keras.pd_dataframe_to_tf_dataset(trainSet, label="Transported")
validSetTF = tfdf.keras.pd_dataframe_to_tf_dataset(validSet, label = "Transported")

#convert the preTest csv with the added Transported column into a tf dataset for the model to evaluate/predict
test_data = tfdf.keras.pd_dataframe_to_tf_dataset(testSet, label="Transported")

#create instance of the RF decision tree model - supervised learning since we're using labels for each data attribute
model = tfdf.keras.RandomForestModel()

#configure the model before training, set the metrics we want to measure
model.compile(metrics=["accuracy", keras.metrics.Precision(), keras.metrics.Recall()])
#train the model using the converted training data from the trainFull.csv
model.fit(trainSetTF)
predictions = model.predict(validSetTF)
i=1
for prediction in predictions:
  roundedVal = prediction.round()
  if roundedVal==1:
    predictedBool = True
  else:
    predictedBool=False
  #print(f'{i} raw value is {prediction}, rounded val is {roundedVal}, prediction is {predictedBool}')
  i+=1
#print out the declared metrics to evaluate accuracy/performance
model.evaluate(validSetTF)

#model is trained, validation set has been used to evaluate the model
#-------------------------------------------------------
#Begin to test model with testing set to predict the transported values
testPreds = model.predict(test_data)

#list to hold each true/false transported prediction for each user
predictionsList = []
count=1

#iterate through each prediction from the predict() function to print out true/false value for each passenger
for pred in testPreds:
  roundedValTest = pred.round()
  #convert the model's integer predictions into boolean values
  if roundedValTest==1:
    predictedBoolTest = True
  else:
    predictedBoolTest= False
  #print(f'raw value {count} is {pred}, rounded val is {roundedValTest}, prediction is {predictedBoolTest}')
  #add each predicted true/false transported value to a list to store all of the predictions
  predictionsList.append(predictedBoolTest)
  count+=1

model.summary()

testCSV = pd.read_csv("test.csv")

#use a list to store each dict pair of {passId, transportedPrediction}
resultsList = []
passIDCount=0
#iterate through each prediction in our list of transported predictions
#create a dict mapping for each PassengerId and the passenger's prediction
#add this dict to our list of results, repeat
for boolPred in predictionsList:
  results = {}
  results["PassengerId"] = testCSV.PassengerId[passIDCount]
  results["Transported"] = boolPred
  resultsList.append(results)
  passIDCount+=1

#write the final list of predictions to a submission csv file, recording each PassengerId and their predicted true/false value
finalPredsFile = open("sample_submission_format.csv", 'w')
finalWriter = csv.DictWriter(finalPredsFile, results.keys())
finalWriter.writeheader()
for pair in resultsList:
  finalWriter.writerow(pair)

files.download("sample_submission_format.csv")






In [None]:
#run this block to install the tfdf library
!pip3 install tensorflow_decision_forests