# Reading Data

Jupyter notebooks by default will import the numpy and pandas libraries. The code below that just spits out a file name that you can refer to for input data. The most common 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Each input is given a file name that we can call to import the data table. Pandas has built-in methods to read .xlsx and .csv files.

In [None]:
population=pd.read_csv('/kaggle/input/population-by-country-2020/population_by_country_2020.csv')
reactors=pd.read_excel('/kaggle/input/reactors/usreact17.xlsx')

In [None]:
population.head()
#[:8]

Population is a standard dataframe with many different attributes for a given country, with each attribute in a new column.

In [None]:
reactors.head()

This dataframe does not have very clean formatting in dataframe. Let's look at some more rows to see how's it's read in dataframe.

In [None]:
#syntax refers to rows 0-8 ([8:] would refer to all rows after row 8)
reactors[:8]

The structure of reactors is a bit more complicated. In excel this will look normal, but the dataframe does not present it as cleanly- because all cells are the same size and blak cells are filled with 'NaN'. Rows 0-3 have no data and just a couple of descriptions. Knwoing that the units of energy are megawatt hours, we can get rid of those rows and get a much cleaner dataframe.

In [None]:
reactors[4:]

This can be misleading because the actual column descriptions are stored in the first row, not in the headers.

# Fixing the Data

In [None]:
#get rid of empty rows
df=reactors[4:]

#iterate over every column
for column in df:
    
    #rename each column as the name provided in row 4
    df = df.rename(columns = {column:df[column][4]}) 
df

Great- now we have the proper headings. Let's get rid of that first row now because it no longer has useful data.

In [None]:
#can get rid of original row that had column headers
df=df[1:]
df.head()

The method .value_counts() can be very useful. I like to use it whenever I'm analyzing a new column of data so I understand how many different items I'm looking at (unless it's a measurment), and how they are distributed.

In [None]:
df.State.value_counts()

Notice that there is a new row for a state's total output from all of its nuclear reactors. This can be helpful, but it may cause problems if we want to analyze by state- let's remove the 'total' rows so that we only have the individual states' reactors.

We need to go through all of the entries in the 'State' column and remove all rows with the phrase 'Total' in it.

In [None]:
#give df a new name so that we can distinguish between the dataframes with and
#without state totals
short_df=df

In [None]:
#df.State.value_counts() will give us a series of state names and the number of
#time they show up in the 'State' column.

#Adding the .index will give a list-type structure such that we can iterate through
#every state
df.State.value_counts().index

In [None]:
#each entry is a string
type(df.State.value_counts().index[0])

In [None]:
#iterate through every unique string in the 'State' column
for string in df.State.value_counts().index:
    
    #can check if the phrase 'Total' is in the entry (this will be read as true or false)
    if 'Total' in string:
        #if true
        
        #datframe of true/false based on whether the string is in the 'State' column
        false_df=short_df['State'].isin([string])
        
        #dataframe with the falses (only states that don't have 'Total' in their name)
        short_df=short_df[-false_df]
short_df

In [None]:
#check work
short_df.State.value_counts()

# Analyzing the Data- the first steps

Compute a state's totals for the month of January and compare to the provided totals in the original dataframe.

In [None]:
alabama=short_df[short_df.State.isin(['AL'])].reset_index()
sum=0
for row in range(len(alabama)):
    sum+=alabama['January'][row]
sum

In [None]:
#simpler method
alabama.sum()['January']

**important note

When creating the dataframe of Alabama's data, I added the reset_index() function at the end. This is not always necessary when creating a new dataframe. The reason for this is I get a Key Error: 0 if I don't reset the index. The new index will maintain the same indices as the original data, which means that the alabama dataframe will start with a non-zero index, as seen below.

In [None]:
short_df[short_df.State.isin(['AL'])]

Indices here are in order, but sometimes they will be random depending on where the data was in the original dataset. This is problematic because when I iterate through the row by calling range(len(alabama)), I need to iterate through a zero-based index that goes in order. len is the length of the alabama dataframe (number of rows), while range inicates that I intend to iterate over that number of rows, starting from 0. Thus, the reset_index() function reindexes the dataframe starting from zero.

In [None]:
short_df[short_df.State.isin(['AL'])].reset_index()

In [None]:
df[df['State'].isin(['Alabama Total'])]['January']

The two totals the indeed the same. This dataset was imported from an excel sheet, so I'm sure that total was computed by a simple =SUM() function in excel. Though this was a simple example, it's important to know the mechanics for how compute a total so that it may be performed on specific groups.

Let's do a more specific example. Say we want to compute the yearly output for all the states in the midwest (just consider the midwest as Illinois, Iowa, Indiana, Wisconsin, and Ohio for now).

In [None]:
#this will require a couple different 'levels' to break down

#first let's create a loop that will go through each of the desired states
midwest=df[df['State'].isin(['MN','IL','IN','WI','OH'])].reset_index()
sum=0
for row in range(len(midwest)):
    sum+=midwest['Year_to_Date'][row]
sum

In [None]:
#simpler method
midwest.sum()['Year_to_Date']

In [None]:
midwest.sum()

In [None]:
#add sun belt as a region for comparison
sunbelt=df[df['State'].isin(['FL','GA','SC','AL','MS'])].reset_index()

Another common technique is to group certain data into a new dataframe. If we want to breakdown the data by region, it may be helpful to create a new dataframe of regional data.

In [None]:
#data for summer months
raw_data={'Month':['June','June','July','July','August','August'],
         'Region':['Midwest','Sun Belt','Midwest','Sun Belt','Midwest','Sun Belt'],
         'Total':[midwest.sum()['June'],sunbelt.sum()['June'],midwest.sum()['July'],sunbelt.sum()['July'],
                 midwest.sum()['August'],sunbelt.sum()['August']]}
summer=pd.DataFrame(raw_data, columns=['Month','Region','Total'])

summer

# Visualization- stacked bar plot

In [None]:
import matplotlib.pyplot as plt

First bar plot we'll organize the x axis by month, and stack each regions total ouput for each month.

In [None]:
#easiest way for me to do stacked bar plot is to add each level one by one
#for each level, specify x axis and y axis data

p1=plt.bar(summer[summer['Region'].isin(['Midwest'])]['Month'], summer[summer['Region'].isin(['Midwest'])]['Total'])
p2=plt.bar(summer[summer['Region'].isin(['Sun Belt'])]['Month'], summer[summer['Region'].isin(['Sun Belt'])]['Total'],
       bottom=summer[summer['Region'].isin(['Midwest'])]['Total'])
plt.ylabel('Energy Output (MegaWatt Hours)')
plt.title('Energy Output for Summer Months')
plt.legend((p1[0], p2[0]), ('Midwest', 'Sun Belt'))

Thenext bar graph will be the opposite- have the regins on the x axis and stack their monthly outputs. 

In [None]:
#these variables are created beforehand so that the plotting code is a bit easier to interpret
#we want y axis data to be specified by month
#this will by default have different data for each region

june_totals=summer[summer['Month'].isin(['June'])]['Total']
july_totals=summer[summer['Month'].isin(['July'])]['Total']
august_totals=summer[summer['Month'].isin(['August'])]['Total']

In [None]:
#on the x axis we want each region, but plotted by month
    #summer[summer['Month'].isin(['June'])]['Region'] 
    #the above code isolates all June data from the dataframe, then adding the ['Region'] column we specify that
    #we want the x axis to be separated by region

p1=plt.bar(summer[summer['Month'].isin(['June'])]['Region'], june_totals)
p2=plt.bar(summer[summer['Month'].isin(['July'])]['Region'], july_totals,
       bottom=june_totals)

#pandas wouldn't compute (june_totals + july_totals) due to indexing/formatting, but it works when I add .values
p3=plt.bar(summer[summer['Month'].isin(['August'])]['Region'], august_totals,
          bottom=june_totals.values + july_totals.values)
           
plt.ylabel('Energy Output (MegaWatt Hours)')
plt.legend((p1[0], p2[0],p3[0]), ('June', 'July','August'))

# Visualization- Stacked area chart

Another idea could be to look at outputs for different reactors in a given state

In [None]:
il=short_df[short_df['State'].isin(['IL'])]
il

In [None]:
il_short=il[['Plant ID','Plant Name','January','February','March']]
p1=plt.bar(il_short['Plant Name'],il_short['January'])
p2=plt.bar(il_short['Plant Name'],il_short['February'],bottom=il_short['January'])

plt.xticks(rotation=90)
