<a href="https://colab.research.google.com/github/uri-rizo2/Expanding-RENES/blob/main/Expanding_RENES_CRM_Application.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Expanding RENES: CRM Application**









> **Uriel Amezcua** ||
  **CSCI 198** || **Spring 23**


# **Import and load files**

- In this code section, I begin by installing the necessary libraries, including pandas, matplotlib, and seaborn. These libraries are essential for data manipulation, visualization, and analysis tasks.
- Next, I import the `files` module from the `google.colab` library. This module allows me to upload files to Google Colab.
- I then upload a CSV file containing my finished weather data. This file contains the information I need for my analysis.
- Using the pandas library, I read the uploaded CSV file and store the data in a DataFrame called `complete_data`. This DataFrame will serve as the main data structure for my analysis.
- To get a quick overview of the data, I display the content of the `complete_data` DataFrame. This allows me to visually inspect the data and check if it has been loaded correctly.
- Additionally, I call the `info()` method on the `complete_data` DataFrame. This provides a concise summary of the DataFrame, including the number of rows and columns, as well as the data types of each column. This information is useful for understanding the structure of the data and identifying any potential issues or missing values.

In [None]:
# Install necessary libraries
!pip install pandas matplotlib seaborn

# Load CSV file
from google.colab import files
uploaded = files.upload()
import pandas as pd
import io
complete_data = pd.read_csv(io.BytesIO(uploaded['Finished_Weather_Data.csv']))

# Display data and info
display(complete_data)
complete_data.info()

# **Clean and inspect Data**
- In this code section, I import the NumPy library and assign it the alias `np`. NumPy provides powerful numerical computing capabilities in Python.
- Next, I set a display option in pandas to format floating-point numbers. By setting `pd.options.display.float_format` to `'{:.2f}'.format`, I ensure that floating-point values will be displayed with two decimal places.
- Following that, I perform data cleaning operations on the `complete_data` DataFrame. First, I strip any leading or trailing white spaces from the values in the 'Ghorb0' column using the `str.strip()` method. Similarly, I apply the `str.strip()` method to the 'Ghord0' column. This step ensures that any unwanted white spaces are removed from the data.
- Finally, I display the updated `complete_data` DataFrame. This allows me to inspect the cleaned data, ensuring that the modifications have been applied correctly. The displayed DataFrame will show the formatted floating-point numbers and the stripped values in the 'Ghorb0' and 'Ghord0' columns, respectively.

In [None]:
import numpy as np
##clean data from empty white space and round decimal values
pd.options.display.float_format = '{:.2f}'.format
complete_data['Ghorb0'] = complete_data['Ghorb0'].astype(str).str.strip()
complete_data['Ghord0'] = complete_data['Ghord0'].astype(str).str.strip()


display(complete_data)


In [None]:
complete_data["Ghorb0"].replace("nan", np.nan, inplace=True)  # Replace "nan" values with NaN in the "Ghorb0" column
complete_data['Ghord0'].replace("nan", np.nan, inplace=True)  # Replace "nan" values with NaN in the "Ghord0" column

complete_data["Ghorb0"] = pd.to_numeric(complete_data["Ghorb0"])  # Convert "Ghorb0" column to numeric data type
complete_data['Ghord0'] = pd.to_numeric(complete_data["Ghord0"])  # Convert "Ghord0" column to numeric data type

complete_data = complete_data.round({'Ghorb0': 3, 'Ghord0': 3})  # Round values in the "Ghorb0" and "Ghord0" columns to 3 decimal places

display(complete_data)  # Display the updated complete_data DataFrame

In [None]:
complete_data = complete_data.dropna()
display(complete_data)

# **Data Preprocessing and Subset Selection**
- In this code section, I create a new DataFrame called `Neural_set` by selecting specific columns from the `complete_data` DataFrame. The selected columns are 'Ghorb0', 'Ghord0', 'Sol Rad (Ly/day)', 'Air Temp (F)', 'Rel Hum (%)', and 'cloud coverage'. This new DataFrame serves as a subset of the original data, containing only the chosen columns.
- Next, I define a dictionary called `new_dtypes` to specify the new data types for each column in the `Neural_set` DataFrame. Each column is mapped to its corresponding data type. For example, 'Ghorb0' and 'Ghord0' are assigned the 'float16' data type, 'Rel Hum (%)' and 'cloud coverage' are assigned the 'uint8' data type, and so on.
- Then, I use the `astype()` method to convert the columns in the `Neural_set` DataFrame to the new data types specified in the `new_dtypes` dictionary. This ensures that the columns have the desired data types for further analysis or modeling tasks.
- Afterward, I display the updated `Neural_set` DataFrame. This allows me to verify that the data types of the columns have been successfully converted. The displayed DataFrame shows the selected columns with their updated data types.
- Finally, there is a commented line of code to save the updated dataset to a new CSV file called 'updated_dataset.csv'.



In [None]:
Neural_set = complete_data[['Ghorb0','Ghord0','Sol Rad (Ly/day)','Air Temp (F)','Rel Hum (%)','cloud coverage']].copy()

# Define the new data types for each column
new_dtypes = {
    'Ghorb0': 'float16',
    'Ghord0': 'float16',
    'Sol Rad (Ly/day)': 'float16',
    'Air Temp (F)': 'float16',
    'Rel Hum (%)': 'uint8',
    'cloud coverage': 'uint8'
}

# Convert the columns to the new data types
Neural_set = Neural_set.astype(new_dtypes)

# Save the updated dataset to a new CSV file
#Neural_set.to_csv('updated_dataset.csv', index=False)

display(Neural_set)


# **Configuration and library installation**
1. TensorFlow Installation: I use the `pip` package manager to install the TensorFlow library, which is a powerful machine learning framework widely used for building neural networks and performing various tasks related to deep learning.

2. I import specific modules from TensorFlow that I will be using in my code. These modules include `tensorflow`, `keras`, and `layers`. They provide essential functions and classes for constructing and training neural networks efficiently.

3. Seaborn Installation: As part of my data visualization requirements, I install the `seaborn` library using `pip`. Seaborn is a popular data visualization library that works in conjunction with Matplotlib and offers additional functionalities and aesthetically pleasing plot styles.

4. Importing Plotting Libraries: To enable visualizations and create informative plots, I import the `matplotlib.pyplot` module and the `seaborn` library.

6. Importing Data Processing Libraries: In order to handle data efficiently and perform various data manipulation tasks, I import the `numpy` and `pandas` libraries.

7. Configuring NumPy Printouts: To enhance the readability of NumPy arrays in my code, I configure the printout format using the `np.set_printoptions()` function. I set the precision of floating-point numbers to 3 decimal places and suppress the use of scientific notation. This ensures that the printed arrays are displayed in a clear and understandable manner.


In [None]:
# Install the TensorFlow library and import necessary modules.
!pip install tensorflow
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Print the version of TensorFlow installed.
#print(tf.__version__)

# Install the seaborn library for plotting.
!pip install -q seaborn

# Import necessary plotting libraries.
import matplotlib.pyplot as plt
import seaborn as sns

# Import necessary data processing libraries.
import numpy as np
import pandas as pd

# Configure NumPy printouts to be easier to read.
np.set_printoptions(precision=3, suppress=True)


# **Data Splitting, Visualization, and Feature Normalization**

In this section of the code, the dataset is prepared for training a machine learning model. The steps involved are as follows:

1. Splitting the Data:
   - The dataset, `Neural_set`, is split into training, validation, and test sets.
   - The `train_dataset` is created by randomly sampling 70% of the data using `sample()` function, ensuring reproducibility with `random_state=0`.
   - The remaining data is used to create the `test_dataset`.
   - From the `test_dataset`, a subset is randomly sampled for the validation set using `sample()` function with `frac=0.5` and `random_state=0`.
   - The remaining data is kept as the final `test_dataset`.

2. Visualizing the Dataset:
   - The `train_dataset` is visualized using the `sns.pairplot()` function from the seaborn library.
   - The variables 'Ghorb0', 'Ghord0', 'Sol Rad (Ly/day)', 'Air Temp (F)', 'Rel Hum (%)', and 'cloud coverage' are selected for visualization.
   - The diagonal elements of the plot are shown as kernel density estimation (kde) plots.

3. Extracting Labels:
   - The labels for the training, validation, and test sets are extracted from their respective datasets.
   - The label 'Sol Rad (Ly/day)' is popped from the datasets and stored in `train_labels`, `val_labels`, and `test_labels` respectively.

4. Feature Normalization:
   - The features of the training and test sets are copied to separate variables, `train_features` and `test_features`.
   - A `Normalization` layer is created using `tf.keras.layers.Normalization` and is adapted to the training data.
   - The mean values of the normalizer are printed using `normalizer.mean.numpy()` to observe the normalization process.


In [None]:
# Split the data into training, validation, and test sets
train_dataset = Neural_set.sample(frac=0.7, random_state=0)
test_dataset = Neural_set.drop(train_dataset.index)
val_dataset = test_dataset.sample(frac=0.5, random_state=0)
test_dataset = test_dataset.drop(val_dataset.index)


In [None]:
# Visualize the dataset
sns.pairplot(train_dataset[['Ghorb0','Ghord0','Sol Rad (Ly/day)','Air Temp (F)','Rel Hum (%)','cloud coverage']], diag_kind='kde')



In [None]:
# Extract the labels from the datasets
train_labels = train_dataset.pop('Sol Rad (Ly/day)')
val_labels = val_dataset.pop('Sol Rad (Ly/day)')
test_labels = test_dataset.pop('Sol Rad (Ly/day)')


In [None]:
#Normalize the features
train_features = train_dataset.copy()
test_features = test_dataset.copy()
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(np.array(train_features))

print(normalizer.mean.numpy())

# **Model Training and Evaluation with Error Visualization**

1. Model Definition and Compilation:
   - The `build_and_compile_model` function is defined to create a sequential model using the Keras API.
   - The model architecture consists of three dense layers with ReLU activation.
   - The compiled model uses the Huber loss function and the Adam optimizer with a learning rate of 0.001.
   - The defined model is stored in the `model` variable.

2. Model Training:
   - The `model.fit` function is used to train the model.
   - The training features (`train_features`) and labels (`train_labels`) are passed as inputs.
   - The validation split is set to 0.2, indicating that 20% of the training data will be used for validation.
   - The `verbose` parameter is set to 0, suppressing the training progress output.
   - The training is performed for 100 epochs.

3. Model Evaluation:
   - The trained model is evaluated on the test set using the `model.evaluate` function.
   - The test features (`test_features`) and labels (`test_labels`) are passed as inputs.
   - The evaluation results are stored in the `test_results` dictionary.

4. Error Visualization:
   - The prediction errors are calculated by subtracting the predicted values (`test_predictions`) from the actual labels (`test_labels`).
   - The resulting errors are stored in the `error` variable.
   - The `plt.hist` function is used to create a histogram of the prediction errors.
   - The `error` values are passed as input to the function.
   - The `bins` parameter is set to 25, indicating the number of bins in the histogram.
   - The x-axis label is set to 'Prediction Error [total irradiance]' using `plt.xlabel`.
   - The y-axis label is set to 'Count' using `plt.ylabel`.

In [None]:
# Define the model
def build_and_compile_model(norm):
    model = keras.Sequential([
        norm,
        layers.Dense(64, activation='relu'),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])
    model.compile(loss='huber_loss',
                  optimizer=tf.keras.optimizers.Adam(0.001))
    return model

In [None]:
# Create the model
model = build_and_compile_model(normalizer)

In [None]:
# Train the model
history = model.fit(
    train_features,
    train_labels,
    validation_split=0.2,
    verbose=0,
    epochs=100)

In [None]:
# Evaluate the model on the test set
test_results = {}
test_results['model'] = model.evaluate(test_features, test_labels, verbose=0)

result_table = pd.DataFrame(test_results, index=['Huber Loss [Total Radiation]']).T
print(result_table)



In [None]:
# Plot the distribution of prediction errors
error = test_predictions - test_labels
plt.hist(error, bins=25)
plt.xlabel('Prediction Error [total irraidance]')
_ = plt.ylabel('Count')

# **Causal Inference with Gaussian Process Regression**
1. Data Sampling:
   - A random sampling approach is used to select 10% of the test features (`test_features`) as a representative subset.
   - The `random.sample` function is utilized to randomly sample data points without replacement.
   - The sampled data is stored in the `sampled_data` variable.

2. Data Splitting:
   - The sampled data is split into training and testing sets using the `train_test_split` function from scikit-learn.
   - The `test_size` parameter is set to 0.2, indicating that 20% of the sampled data will be used for testing.
   - The training data is stored in the `train_data` variable, and the testing data is stored in the `test_data` variable.

3. Memory Usage:
   - The size of the `sampled_data` variable is calculated using the `sys.getsizeof` function.
   - The resulting memory size is printed to the console.

4. Kernel Definition:
   - The covariance function, or kernel, is defined for the Gaussian Process model.
   - The Radial Basis Function (RBF) kernel is used in this case.

5. Gaussian Process Model Creation:
   - The Gaussian Process Regressor from scikit-learn is instantiated with the defined kernel.
   - The `GaussianProcessRegressor` function is used to create the model, and the model object is stored in the `model` variable.

6. Model Fitting:
   - The Gaussian Process model is fitted to the training data using the `fit` method.
   - The training features (`train_features`) and labels (`train_labels`) are provided as inputs to the model.

7. Prediction:
   - The trained model is used to make predictions on the test features (`test_features`).
   - The `predict` method is called on the model, and the predicted values are stored in the `predictions` variable.


In [None]:
import random
import sys
from sklearn.model_selection import train_test_split
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process import GaussianProcessRegressor

# randomly sample 10% of data
sampled_data = random.sample(list(test_features.to_records(index=False)), int(len(test_features)*0.1))


# split into training and testing sets
train_data, test_data = train_test_split(sampled_data, test_size=0.2)




sampled_data_size = sys.getsizeof(sampled_data)
print("Sampled data size:", sampled_data_size, "bytes")


Sampled data size: 4344 bytes


In [None]:
# define kernel (covariance function)
kernel = RBF()

# create Gaussian process model
model = GaussianProcessRegressor(kernel=kernel)

# fit model to data
model.fit(train_features, train_labels)

# make predictions on test data
predictions = model.predict(test_features)