U.S. electricity generation by source per month from 1950-2020 (in million Kilowatt hours)
Cleaned down to just renewable energy, between 2010 and 2020 (most recent)

https://www.eia.gov/totalenergy/data/browser/index.php?tbl=T07.02A#/?f=M

In [51]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [52]:
# Each row title corresponds to a different type of electricity generation source
MSN_DICT = {"CLETPUS":"Coal", "PAETPUS":"Petroleum", "NGETPUS":"Natural Gas", 
          "OJETPUS":"Other Gases", "NUETPUS":"Nuclear", "HPETPUS":"Hydroelectric Pump", 
          "HVETPUS":"Hydroelectric","WDETPUS":"Wood", "WSETPUS":"Waste", 
          "GEETPUS":"Geothermal", "SOETPUS":"Solar", "WYETPUS":"Wind", "ELETPUS":"Total"}

# The sources we care about
SOURCES = ["Solar"]

In [53]:
def clean_data(df):
    ''' Function: prepares dataframe for analysis
        Parameters: dataframe
        Returns: dataframe
    '''
    # create new column with just the year
    df["Year"] = df["YYYYMM"] / 100
    df["Year"] = df["Year"].apply(lambda x: int(x))
    
    # create new column with just the month
    df["Month"] = df["YYYYMM"] % 100
    
    # translate MSN to corresponding energy source
    df["Source"] = df["MSN"]
    df["Source"] = df["MSN"].apply(lambda x: MSN_DICT[x])
    
    # take out year totals (Month = 13), and only the last 10 years
    df = df.loc[(df["Month"] != 13)]
    df = df.loc[(df["Year"] >= 2010)]
    
    # remove unnecessary columns
    df = df.drop(["MSN", "YYYYMM", "Unit", "Description", "Column_Order", "Month"], 1)

    # make sure all values in Value are floats
    df["Value"] = df["Value"].apply(lambda x: float(x))
    
    return df

In [55]:
def plot_sources(dfs):
    ''' Function: plots a graph for each dataframe in a list
        Parameters: list of dataframes
        Returns: prints graphs
    '''
    for i in range(len(dfs)):
        sns.regplot(x = dfs[i]["Year"], y = dfs[i]["Value"])
        plt.title(SOURCES[i]+" Electricity Generation 2010-2020 (U.S.)")
        plt.ylabel("Million Kilowatt Hours")
        plt.show()

In [58]:
def regression(df):
    X = df[['2010'cp8 Ugj, '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']]
    y = df['2021']

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

    rf = RandomForestRegressor(random_state=7)
    rf.fit(X_train, y_train)

    y_pred = rf.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    print('The mse of the model is: {}'.format(mse))

In [64]:
df = pd.read_csv("MER_T07_02A.csv")
clean_df = clean_data(df)
solar_df = df.loc[(clean_df["Source"] == "Solar")]
print(df)
#plot_sources(dfs)
#data2020 = lin_reg(dfs, 2020)
#data2030 = lin_reg(dfs, 2030)

          MSN  YYYYMM       Value  Column_Order  \
0     CLETPUS  194913   135451.32             1   
1     CLETPUS  195013  154519.994             1   
2     CLETPUS  195113  185203.657             1   
3     CLETPUS  195213  195436.666             1   
4     CLETPUS  195313  218846.325             1   
...       ...     ...         ...           ...   
8549  ELETPUS  202106   373670.86            13   
8550  ELETPUS  202107  404662.614            13   
8551  ELETPUS  202108  413949.298            13   
8552  ELETPUS  202109  348077.041            13   
8553  ELETPUS  202110  321061.685            13   

                                            Description  \
0     Electricity Net Generation From Coal, All Sectors   
1     Electricity Net Generation From Coal, All Sectors   
2     Electricity Net Generation From Coal, All Sectors   
3     Electricity Net Generation From Coal, All Sectors   
4     Electricity Net Generation From Coal, All Sectors   
...                              

Challenges with project:
data was organized for plotting, not regression
reorganize data to fit regression - a column for each year instead of a row for each year
change from reading multiple sources to just one - not as much of a challenge, just a change
BIG CHANGE - use other data sources for regression, not just old generation data

Project background:
wanted to bulid upon a small project in a subject I'm interested in (renewables data) that used a basic concept in linear regression and use a machine learning regression algorithm instead, which I learned in a later DS class.