# Project 2 Normal Forms
Authors: Lily Geiser, Meredith Lou, Rich Pihlstrom

In this section, we will walk through how we are splitting our data into 3rd Normal Form. There is not a lot of complexity to this step, so this will be more like an annotation of the column selection.

## Read Files

In [6]:
import pandas as pd

In [7]:
dfs = {}

files = ["companies","emissions","stocks"]
for file in files:
    with open(f"data/cleaned/{file}.csv", "r") as f:
        dfs[file] = pd.read_csv(f)


## Company name-code mapping
This dataset is the only instance where we are not keeping the bulk of a single dataset together. In our companies dataset, there is a column for both the company name and the stock code. Given that both of these are able to identify a company independently, this would mean that data would be dependent on another column other than the primary key—if the key was one of the two columns. With this, we create an additional dataset that maps company names to codes, and then just keep one of the two "name" or "code" columns for the other company-related columns. As this is a direct mapping, we arbitraily choose one of the the columns as the key.

### Company_map, key = [name]

In [8]:
company_map = dfs["companies"][["name","code"]]
company_map.to_csv("data/3NF/name_code_map.csv", index = False)
company_map.head()

Unnamed: 0,name,code
0,apple inc.,aapl
1,microsoft corporation,msft
2,alphabet (google),goog
3,amazon,amzn
4,nvidia corporation,nvda


## Company dataset

With the map file created above, we can then drop either "name" or "column" from our company data. Given that this dataset contains company information, rather than stock information, we opted keep the "name" column. All of the other data is about each specific company, so we can select "name" to be our key.

### Companies, key = [name]

In [9]:
companies = dfs["companies"][["name","revenue_22_23_USD_e9","market_cap_USD_e12","emp_num",
                              "founded","incomeTax_22_23_USD_e9","sector","state"]]
companies.to_csv("data/3NF/companies.csv", index = False)
companies.head()

Unnamed: 0,name,revenue_22_23_USD_e9,market_cap_USD_e12,emp_num,founded,incomeTax_22_23_USD_e9,sector,state
0,apple inc.,387.53,2.52,164000,1976,18.314,consumer electronics,california
1,microsoft corporation,204.09,2.037,221000,1975,15.139,software infrastructure,washington
2,alphabet (google),282.83,1.35,190234,1998,11.356,software infrastructure,california
3,amazon,513.98,1.03,1541000,1994,-3.217,software application,washington
4,nvidia corporation,26.97,0.653,22473,1993,0.189,semiconductors,california


## Stock data

Similar to the company data, the rows of the stock data are independent for each company. However, unlike the company data, the stock data is aggregated to a yearly granularity. As such, we must consider both the company code and the year for each row.

### Stocks, key = [code, year]

In [10]:
stocks = dfs["stocks"]
stocks.to_csv("data/3NF/stocks.csv", index = False)
stocks.head()

Unnamed: 0,code,year,high,low,change_in_close
0,aapl,2010,11.666429,6.794643,3.876786
1,aapl,2011,15.239286,11.089286,2.693929
2,aapl,2012,25.18107,14.607143,4.319285
3,aapl,2013,20.540714,13.753571,0.428215
4,aapl,2014,29.9375,17.626785,7.840357


## Emissions data

Like with the stock data, we are looking at yearly information across various states. We handle this similarly by considering both the state and the year as the key.

### Emissions, key = [state, year]

In [11]:
emissions = dfs["emissions"]
emissions.to_csv("data/3NF/emissions.csv", index = False)
emissions.head()

Unnamed: 0,state,year,emissions_per_cap
0,alabama,1970,29.7
1,alaska,1970,37.3
2,arizona,1970,13.9
3,arkansas,1970,18.7
4,california,1970,14.7
