# Getting the most solar power for your dollar
## Preprocessing and feature engineering
### Zachary Brown

The data has been cleaned and preliminary analysis has identified some trends we should expect to see the eventual model pick up on. Now I'm going to preprocess the data so that any models I work with can use the data appropriately. This will include imputing missing data, feature engineering, scaling, and splitting the data into testing and training datasets.

I'll start by loading the necessary packages and reading in the data from the exploratory data analysis portion of the project.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme('notebook')
import scipy.stats
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_regression

In [2]:
print(os.getcwd())
os.chdir(r"..\data\processed")
print(os.getcwd())

C:\Users\Zjbro\OneDrive\Documents\GitHub\Solar-Panel-Capstone\notebooks
C:\Users\Zjbro\OneDrive\Documents\GitHub\Solar-Panel-Capstone\data\processed


In [3]:
data = pd.read_csv('processed_data.csv', index_col=0, na_values = [-1, '-1'], low_memory=False)
data.shape

(208257, 57)

In [4]:
data.columns.groupby(data.dtypes)

{int64: ['expansion_system', 'multiple_phase_system', 'year', 'month', 'state_AZ', 'state_CA', 'state_CO', 'state_CT', 'state_DE', 'state_FL', 'state_MA', 'state_MD', 'state_MN', 'state_NH', 'state_NM', 'state_NY', 'state_RI', 'state_TX', 'state_UT', 'state_WI'], float64: ['system_size_dc', 'total_installed_price', 'rebate_or_grant', 'tracking', 'ground_mounted', 'third_party_owned', 'self_installed', 'azimuth_1', 'tilt_1', 'module_quantity_1', 'additional_modules', 'bipv_module_1', 'bifacial_module_1', 'nameplate_capacity_module_1', 'efficiency_module_1', 'inverter_quantity_1', 'additional_inverters', 'micro_inverter_1', 'solar_storage_hybrid_inverter_1', 'built_in_meter_inverter_1', 'dc_optimizer', 'inverter_loading_ratio', 'price_per_kw'], object: ['data_provider_1', 'system_id_1', 'installation_date', 'customer_segment', 'zip_code', 'city', 'utility_service_territory', 'installer_name', 'module_manufacturer_1', 'module_model_1', 'technology_module_1', 'inverter_manufacturer_1', 'in

I'm going to first remove total installed price because it's not something that's really in the user's control. They may be able to shop around for a few quotes, but in reality the goal of the project is to determine what other levers can be pulled to reduce that price as much as possible. Next I'll remove city since the goal of my project is just to analyze cost efficiency at the state level. Zip code may give some interesting insights on a more granular level, but I don't see additional value in keeping city.

In [5]:
data=data.drop(columns=['total_installed_price', 'city'])

I want to check each column for the percentage of values that are missing, and remove any features with more than 30% missing values

In [6]:
percent_missing = data.isnull().sum()/len(data)*100
percent_missing.sort_values(ascending=False)

date_of_battery_install            93.650634
ground_mounted                     27.698949
azimuth_1                          22.536097
tilt_1                             22.512569
tracking                           21.979573
inverter_loading_ratio             21.270353
additional_modules                 18.227479
additional_inverters               18.227479
solar_storage_hybrid_inverter_1    17.041444
inverter_quantity_1                14.037463
module_quantity_1                  14.007212
efficiency_module_1                12.028887
built_in_meter_inverter_1          11.630341
micro_inverter_1                   11.630341
inverter_manufacturer_1            11.622178
inverter_model_1                   11.621698
bipv_module_1                      11.000351
dc_optimizer                       10.992668
nameplate_capacity_module_1        10.799637
technology_module_1                10.606126
bifacial_module_1                  10.606126
module_model_1                     10.605646
module_man

The only feature that needs to be removed due to null values is 'date_of_battery_install'. 

In [7]:
data=data.drop(columns=['date_of_battery_install'])

I'm going to check how many zip codes are in the dataset, convert them from 9 digit to 5, and then check to see how many times the top five unique zip codes are present in the dataset.

In [8]:
data['zip_code'].nunique()

7602

In [9]:
data['zip_code'] = data['zip_code'].str[0:5]

In [10]:
print(data['zip_code'].nunique())
data['zip_code'].value_counts().head()

4113


92584    1141
92058    1128
92336    1047
93727     982
95762     908
Name: zip_code, dtype: int64

So it looks like removing the last four digits from the zip code reduced some of the variance, but even so none of the zip codes are used very much relative to the 200,000 entries the dataset has. I'll check how many zip codes have more than 30 entries and then decide on a threshold for which should get dummy columns.

In [11]:
print((data['zip_code'].value_counts() > 30).sum())

1287


Right now I have 57 features in my data. A general rule of thumb is to keep the number of features limited to the square root of the number of entries in the data. For this dataset that means I should stick to 456 or fewer features. With that said, I will go ahead and create dummy columns for each of these zip codes and then later on I'll use univariate feature selection to retain only the 400 most statistically significant features when subjected to an independent t-test.

In [12]:
data.shape

(208257, 54)

In [13]:
data['zip_other'] = 0
small_zips = []

for zipcode in data['zip_code'].unique():
    if data['zip_code'].value_counts(dropna=False)[zipcode] > 30:
        new_col = pd.Series(((data['zip_code'] == zipcode)*1), name = f'zip_{zipcode}')
        data = pd.concat([data, new_col.to_frame()], axis=1)
    else:
        small_zips.append(zipcode)
        
data.loc[data['zip_code'].isin(small_zips), 'zip_other'] = 1 
data = data.drop(columns=['zip_code'])
data.shape

(208257, 1342)

Next I want to browse the object columns and count how many unique values each has. If a feature has too many or only one unique value they won't help identify any trends.

In [14]:
for col in data.columns:
    if data[col].dtypes == 'object':
        print(col, ' : ', data[col].nunique())

data_provider_1  :  22
system_id_1  :  200022
installation_date  :  530
customer_segment  :  1
utility_service_territory  :  71
installer_name  :  2640
module_manufacturer_1  :  156
module_model_1  :  2459
technology_module_1  :  6
inverter_manufacturer_1  :  63
inverter_model_1  :  628


Based on these results it should be safe to remove system_id_1, as that has a unique value for almost every entry. I'll also drop customer_segment since earlier in the project I limited the dataset to only residential installations. 

In [15]:
data = data.drop(columns=['system_id_1', 'customer_segment'])

Great! Now I need to encode these categorical features as I did with the zip codes earlier. I'll count the unique values for each column and if a value appears in more than 30 entries then I'll dummy encode it, anything below 30 will get lumped into an 'other' column. Since technology_module_1 and data_provider_1 have fewer than 30 unique values I'll just dummy encode those entire columns.

In [16]:
data = pd.get_dummies(data, columns=['technology_module_1', 'data_provider_1'])
data.shape

(208257, 1366)

In [17]:
cols = ['installation_date', 'utility_service_territory', 'installer_name', 'module_manufacturer_1',\
        'module_model_1', 'inverter_manufacturer_1', 'inverter_model_1']
for col in cols:
    data[f'{col}_other'] = 0
    small_vals = []

    for val in data[col].unique():
        if data[col].value_counts(dropna=False)[val] > 30:
            new_col = pd.Series(((data[col] == val)*1), name = f'{col}_{val}')
            data = pd.concat([data, new_col.to_frame()], axis=1)
        else:
            small_vals.append(val)
        
    data.loc[data[col].isin(small_vals), f'{col}_other'] = 1 
    data = data.drop(columns=[col])
dummied = data.copy()
dummied.shape

  data[f'{col}_other'] = 0
  data[f'{col}_other'] = 0
  data[f'{col}_other'] = 0
  data[f'{col}_other'] = 0
  data[f'{col}_other'] = 0
  data[f'{col}_other'] = 0


(208257, 2908)

In [18]:
dummied['price_per_kw'].head()

108019    3036.156250
108020    2586.400000
108142    4015.238095
108175    2788.732394
108233    3500.000000
Name: price_per_kw, dtype: float64

Now that I have dummy columns for all categorical values with 30 or more entries I'm going to impute any missing values with the most common value for that feature and later on once I've done initial modeling I can take a closer look at the most important features to consider whether the imputed values should be changed.

In [19]:
from sklearn.impute import SimpleImputer

mode_imputer = SimpleImputer(strategy='most_frequent')

for col in dummied.columns:
    if dummied[col].isna().sum() != 0:
        dummied[col] = mode_imputer.fit_transform(dummied[col].values.reshape(-1,1))
        
imputed = dummied.copy()

  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)
  mode = stats.mode(array)


In [20]:
imputed.shape

(208257, 2908)

Now that I have dummy columns for all categorical values with 30 or more entries, I'm going to trim down the dataframe to 400 features based on f-regression for each.

In [22]:
x = imputed.drop(columns=['price_per_kw'])
y = imputed['price_per_kw']
selector = SelectKBest(f_regression, k=400)
transformed = selector.fit_transform(x, y)

MemoryError: Unable to allocate 4.51 GiB for an array with shape (2907, 208257) and data type float64

In [None]:
features = transformed.get_support(indices=True)
selected = x.iloc[:,features]
selected.shape

In [None]:
selected['price_per_kw']

Now that the dataframe has had categorical values dummied, missing values imputed, and then been pared down to the 400 most valueable features, I'll perform the train test split required for modeling.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(selected, y, test_size=0.25)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Now I'll do one last check of the entire dataframe to confirm that there are no more null values or -1s present, then we should be ready to start modeling.

In [None]:
# Total null values in dataframe
print(X_train.isna().sum().sum())
print(X_test.isna().sum().sum())

In [None]:
# Total -1 values in dataframe
print((X_train.values == -1).sum())
print((X_test.values == -1).sum())

Great! Our data looks good to go. I'll export all four portions of data separately for the modeling portion of the project and they can each be read in separately to that notebook.

In [None]:
X_train.to_csv('X_train.csv')
X_test.to_csv('X_test.csv')
y_train.to_csv('y_train.csv')
y_test.to_csv('y_test.csv')