# Getting the most solar power for your dollar
## Preprocessing and feature engineering
### Zachary Brown

The data has been cleaned and preliminary analysis has identified some trends we should expect to see the eventual model pick up on. Now I'm going to preprocess the data so that any models I work with can use the data appropriately. This will include imputing missing data, feature engineering, scaling, and splitting the data into testing and training datasets.

I'll start by loading the necessary packages and reading in the data from the exploratory data analysis portion of the project.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme('notebook')
import scipy.stats
from sklearn.model_selection import train_test_split

In [None]:
print(os.getcwd())
os.chdir(r"..\data\processed")
print(os.getcwd())

In [None]:
data = pd.read_csv('processed_data.csv', index_col=0, na_values = [-1, '-1'], low_memory=False)
data.shape

In [None]:
data.columns.groupby(data.dtypes)

I'm going to convert zip code from 9 digit to 5, and then switch the data type from object to integer.

In [None]:
data['zip_code'] = data['zip_code'][0:5].astype(int)

I want to check each column for the percentage of values that are missing, and remove any features with more than 30% missing values

In [None]:
percent_missing = data.isnull().sum()/len(data)*100
percent_missing.sort_values(ascending=False)

The only feature that needs to be removed due to null values is 'date_of_battery_install'. 

In [None]:
data=data.drop(columns=['date_of_battery_install'])

Next I want to browse the object columns and count how many unique values each has. If a feature has too many or only one unique value they won't help identify any trends.

In [None]:
for col in data.columns:
    if data[col].dtypes == 'object':
        print(col, ' : ', data[col].nunique())

Based on these results it should be safe to remove system_id_1, as that has a unique value for almost every entry. I'll also drop customer_segment since earlier in the project I limited the dataset to only residential installations. 

In [None]:
data = data.drop(columns=['system_id_1', 'customer_segment'])

Great! Now I need to encode these categorical features as I did with the states earlier. To do so I'll check the number of entries for each unique value for any feature with more than 30 unique values (anything below 30 I'll just dummy encode like I did with states). If certain values appear in more than 10% of the entries then I'll check to see if they correlate with price per KW when compared against all other values for that feature.

In [None]:
data = pd.get_dummies(data, columns=['technology_module_1', 'data_provider_1'])
data.shape

In [None]:
cols = ['installation_date', 'zip_code', 'city', 'utility_service_territory', 'installer_name', 'module_manufacturer_1',\
        'module_model_1', 'inverter_manufacturer_1', 'inverter_model_1']
for col in cols:
    print(col, ':\n', data[col].value_counts(normalize=True).loc[lambda x : x>0.1], '\n')

Based on these distributions I'm going to drop installation_date and installer_name since none of their values account for 10% or more of the entries in the dataset. For the values that do account for at least 10% of the data I'll perform t-tests comparing the price_per_kw for entries with that value vs the rest of the entries. If the p-value of the t-test is less than 0.01 then I'll create a dummy column for it. This includes treating missing data as its own value, since there could be a correlation there as well.

In [None]:
data = data.drop(columns=['installation_date', 'installer_name'])

In [None]:
# Utility service territory: Pacific Gas and Electric
pge = data[data['utility_service_territory'] == 'Pacific Gas and Electric']
not_pge = data[data['utility_service_territory'] != 'Pacific Gas and Electric']
print(scipy.stats.ttest_ind(pge['price_per_kw'], not_pge['price_per_kw']))

In [None]:
data['territory_pacific_gas_and_electric'] = (data['utility_service_territory'] == 'Pacific Gas and Electric')*1

In [None]:
# Utility service territory: Southern California Edison
sce = data[data['utility_service_territory'] == 'Southern California Edison']
not_sce = data[data['utility_service_territory'] != 'Southern California Edison']
print(scipy.stats.ttest_ind(sce['price_per_kw'], not_sce['price_per_kw']))

In [None]:
# Utility service territory: San Diego Gas and Electric
sdge = data[data['utility_service_territory'] == 'San Diego Gas and Electric']
not_sdge = data[data['utility_service_territory'] != 'San Diego Gas and Electric']
print(scipy.stats.ttest_ind(sdge['price_per_kw'], not_sdge['price_per_kw']))

At this point I'm going to make a copy of the dataframe as a bookmark preceding any data imputation or data loss. My next step will impute missing data as an 'other' category, and after initial modeling I may want to jump back to before this step to rework how I handle those missing data.

In [None]:
no_imputation = data.copy()
no_imputation.to_csv('pre-imputation preprocessing data.csv')

In [None]:
data['territory_san_diego_gas_and_electric'] = (data['utility_service_territory'] == 'San Diego Gas and Electric')*1
data['utility_service_territory_other'] = (~data['utility_service_territory'].isin(['Pacific Gas and Electric',\
                                                                                     'San Diego Gas and Electric']))*1
data=data.drop(columns=['utility_service_territory'])

In [None]:
# Module manufacturer 1: Hanwha Q CELLS
hqc = data[data['module_manufacturer_1'] == 'Hanwha Q CELLS']
not_hqc = data[data['module_manufacturer_1'] != 'Hanwha Q CELLS']
print(scipy.stats.ttest_ind(hqc['price_per_kw'], not_hqc['price_per_kw']))

In [None]:
data['hanwha_q_cells'] = (data['module_manufacturer_1'] == 'Hanwha Q CELLS')*1

In [None]:
# Module manufacturer 1: SunPower
sp = data[data['module_manufacturer_1'] == 'SunPower']
not_sp = data[data['module_manufacturer_1'] != 'SunPower']
print(scipy.stats.ttest_ind(sp['price_per_kw'], not_sp['price_per_kw']))

In [None]:
data['sunpower'] = (data['module_manufacturer_1'] == 'SunPower')*1

In [None]:
# Module manufacturer 1: Missing
missing = data[data['module_manufacturer_1'].isna()]
not_missing = data[~data['module_manufacturer_1'].isna()]
print(scipy.stats.ttest_ind(missing['price_per_kw'], not_missing['price_per_kw']))

In [None]:
data['module_manufacturer_1_missing'] = (data['module_manufacturer_1'].isna())*1

In [None]:
# Module manufacturer 1: LG Electronics
lg = data[data['module_manufacturer_1'] == 'LG Electronics']
not_lg = data[data['module_manufacturer_1'] != 'LG Electronics']
print(scipy.stats.ttest_ind(lg['price_per_kw'], not_lg['price_per_kw']))

In [None]:
data['lg_electronics'] = (data['module_manufacturer_1'] == 'LG Electronics')*1
data['module_manufacturer_1_other'] = (~data['module_manufacturer_1'].isin(['Hanwha Q CELLS',\
                                                                                     'SunPower', 'LG Electronics', np.NaN]))*1
data=data.drop(columns=['module_manufacturer_1'])

In [None]:
# Module model 1: missing values
missing = data[data['module_model_1'].isna()]
not_missing = data[~data['module_model_1'].isna()]
print(scipy.stats.ttest_ind(missing['price_per_kw'], not_missing['price_per_kw']))

In [None]:
data['module_model_1_missing'] = (data['module_model_1'].isna())*1
data['module_model_1_not_missing'] = (~data['module_model_1'].isna())*1
data = data.drop(columns=['module_model_1'])

In [None]:
# Inverter manufacturer 1: SolarEdge Technologies
se = data[data['inverter_manufacturer_1'] == 'SolarEdge Technologies']
not_se = data[data['inverter_manufacturer_1'] != 'SolarEdge Technologies']
print(scipy.stats.ttest_ind(se['price_per_kw'], not_se['price_per_kw']))

In [None]:
data['solaredge_technologies'] = (data['inverter_manufacturer_1'] == 'SolarEdge Technologies')*1

In [None]:
# Inverter manufacturer 1: Enphase Energy
ee = data[data['inverter_manufacturer_1'] == 'Enphase Energy']
not_ee = data[data['inverter_manufacturer_1'] != 'Enphase Energy']
print(scipy.stats.ttest_ind(ee['price_per_kw'], not_ee['price_per_kw']))

In [None]:
data['enphase_energy'] = (data['inverter_manufacturer_1'] == 'Enphase Energy')*1

In [None]:
# Inverter manufacturer 1: Missing
missing = data[data['inverter_manufacturer_1'].isna()]
not_missing = data[~data['inverter_manufacturer_1'].isna()]
print(scipy.stats.ttest_ind(missing['price_per_kw'], not_missing['price_per_kw']))

In [None]:
data['inverter_manufacturer_1_missing'] = (data['inverter_manufacturer_1'].isna())*1

In [None]:
# Inverter manufacturer 1: SunPower
sp = data[data['inverter_manufacturer_1'] == 'SunPower']
not_sp = data[data['inverter_manufacturer_1'] != 'SunPower']
print(scipy.stats.ttest_ind(sp['price_per_kw'], not_sp['price_per_kw']))

In [None]:
data['sunpower'] = (data['inverter_manufacturer_1'] == 'SunPower')*1
data['inverter_manufacturer_1_other'] = (~data['inverter_manufacturer_1'].isin(['SolarEdge Technologies',\
                                                                                     'Enphase Energy', 'SunPower', np.nan]))*1
data=data.drop(columns=['inverter_manufacturer_1'])

In [None]:
# Inverter model 1: Missing
missing = data[data['inverter_model_1'].isna()]
not_missing = data[~data['inverter_model_1'].isna()]
print(scipy.stats.ttest_ind(missing['price_per_kw'], not_missing['price_per_kw']))

In [None]:
data['inverter_model_1_missing'] = (data['inverter_model_1'].isna())*1

In [None]:
# Inverter model 1: IQ7-60-2-US [240V]
iq7 = data[data['inverter_model_1'] == 'IQ7-60-2-US [240V]']
not_iq7 = data[data['inverter_model_1'] != 'IQ7-60-2-US [240V]']
print(scipy.stats.ttest_ind(iq7['price_per_kw'], not_iq7['price_per_kw']))

In [None]:
data['iq7'] = (data['inverter_model_1'] == 'IQ7-60-2-US [240V]')*1

In [None]:
# Inverter model 1: SE3800H-US [240V]
se3 = data[data['inverter_model_1'] == 'SE3800H-US [240V]']
not_se3 = data[data['inverter_model_1'] != 'SE3800H-US [240V]']
print(scipy.stats.ttest_ind(se3['price_per_kw'], not_se3['price_per_kw']))

In [None]:
data['se3'] = (data['inverter_model_1'] == 'SE3800H-US [240V]')*1
data['inverter_model_1_other'] = (~data['inverter_model_1'].isin(['IQ7-60-2-US [240V]',\
                                                                                     'SE3800H-US [240V]', np.nan]))*1
data=data.drop(columns=['inverter_model_1'])

In [None]:
print(data.columns.groupby(data.dtypes))

print(data.shape)

Great! All of the non-numeric features have been converted into dummy features or dropped. 

It's important to note that for utility service territory I've imputed the missing values as 'other'. This may need to be adjusted later on as I work through modeling.

Now the rest of the data imputation and scaling needs to be performed on the training dataset, then applied to the test dataset, so now that all of the desired features have been created I'll split the data into test and train sets. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop(columns='price_per_kw'), data['price_per_kw'], test_size=0.25)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Now I need to check the numeric columns to see if they have -1 for missing data, then decide how best to replace those for the modeling.

In [None]:
for col in X_train.columns:
    if X_train[col].isna().sum() != 0:
        print(col, ' : ', (X_train[col].isna().sum()))

First I'll work through categorical numerical columns that only have 0/1 categories to check the distributions and determine how best to impute the missing values.

In [None]:
cols = ['self_installed', 'tracking', 'ground_mounted', 'third_party_owned', 'bipv_module_1', 'bifacial_module_1',\
      'additional_inverters', 'additional_modules', 'dc_optimizer', 'micro_inverter_1', 'built_in_meter_inverter_1',\
      'solar_storage_hybrid_inverter_1']
X_train[cols].apply(func = pd.Series.value_counts, args=('normalize', True))

Many of these columns are heavily skewed, so I'll impute the missing values using the mode and assign the missing values to the more heavily favored response. When I begin modeling I'll compare the model using the imputed results vs removing the columns completely to determine whether I need to reasses how to impute these missing values

In [None]:
X_train[cols] = X_train[cols].fillna(0)
X_test[cols] = X_test[cols].fillna(0)

Now the more complicated categorical features: zip code and city. For zip code I think it makes sense to impute them with the most common value for the state in which that sample lies. The city should probably be the most common city within the zip code of the sample. I'll start with imputing zip code first, then filter down to city.

In [None]:
import uszipcode

Now that the categorical numerical features have been taken care of I'll recheck what continuous numerical variables are left to clean up.

In [None]:
cols=[]
for col in X_train.columns:
    if X_train[col].isna().sum() != 0:
        cols.append(col)
        print(col, ' : ', (round(X_train[col].isna().sum()/len(X_train[col]), 2) * 100), '% missing')

In [None]:
for col in X_train[cols]:
    bins=int(round(np.sqrt(len(data[col].unique()))))
    sns.histplot(data=X_train[col], bins=bins)
    plt.title(col)
    plt.ylabel('Installations')
    plt.xlabel(col)
    plt.show()

In [None]:
X_train['azimuth_1'].value_counts().head()

Right now my goal is to get a quick first look at some models so that I can determine which features are important and then refine the imputation of any missing data for those features if needed. To that end, I'm going to impute these missing values as the mode for azimuth_1 since it's an angle and there is an obvious preference for one specific angle, and the median for the rest.

In [None]:
from sklearn.impute import SimpleImputer

mode_imputer = SimpleImputer(strategy='most_frequent')

X_train['azimuth_1'] = mode_imputer.fit_transform(X_train['azimuth_1'].values.reshape(-1,1))
X_test['azimuth_1'] = mode_imputer.fit_transform(X_test['azimuth_1'].values.reshape(-1,1))


cols.remove('azimuth_1')

median_imputer = SimpleImputer(strategy='median')

for col in X_train[cols]:
    X_train[col] = median_imputer.fit_transform(X_train[col].values.reshape(-1,1))
    
for col in X_test[cols]:
    X_test[col] = median_imputer.fit_transform(X_test[col].values.reshape(-1,1))

Now I'll do one last check of the entire dataframe to confirm that there are no more null values or -1s present, then we should be ready to start modeling.

In [None]:
# Total null values in dataframe
print(X_train.isna().sum().sum())
print(X_test.isna().sum().sum())

In [None]:
# Total -1 values in dataframe
print((X_train.values == -1).sum())
print((X_test.values == -1).sum())

Great! Our data looks good to go. I'll export all four portions of data separately for the modeling portion of the project and they can each be read in separately to that notebook.

In [None]:
X_train.to_csv('X_train.csv')
X_test.to_csv('X_test.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')