# Project 2: Cleaning and Transforming Data for Analysis
## For 3 data sets chosen from the week 5 discussion forum:
1. Create a .csv file
2. Use pandas to read the file and tidy it
3. Perform the analysis suggested in the discussion post

## Data Set 3: Nasdaq history (posted by Gabriel Gutierrez Garcia)
Gabriel mentioned predictive modeling to see what future values might be in the data set.

First, I will import the file and look at the first few rows.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

url = 'https://github.com/sarahbill33/dataacq/blob/main/%5EIXIC.csv?raw=true'
nas = pd.read_csv(url, skipinitialspace=True, index_col=0)

print(nas.head(5))

                  Open         High          Low        Close    Adj Close  \
Date                                                                         
6/21/2022  10974.04981  11164.99023  10974.04981  11069.29981  11069.29981   
6/22/2022  10941.95020  11216.76953  10938.05957  11053.08008  11053.08008   
6/23/2022  11137.67969  11260.26953  11046.28027  11232.19043  11232.19043   
6/24/2022  11351.30957  11613.23047  11337.78027  11607.62012  11607.62012   
6/27/2022  11661.01953  11677.49023  11487.07031  11524.54981  11524.54981   

               Volume  
Date                   
6/21/2022  5201450000  
6/22/2022  5215100000  
6/23/2022  5238210000  
6/24/2022  9438810000  
6/27/2022  5017930000  


Since the index column is the date, and I will probably want to use that, I'm going to add a non-index column for the date.

In [2]:
nas['date_col']=nas.index

print(nas.head(5))

                  Open         High          Low        Close    Adj Close  \
Date                                                                         
6/21/2022  10974.04981  11164.99023  10974.04981  11069.29981  11069.29981   
6/22/2022  10941.95020  11216.76953  10938.05957  11053.08008  11053.08008   
6/23/2022  11137.67969  11260.26953  11046.28027  11232.19043  11232.19043   
6/24/2022  11351.30957  11613.23047  11337.78027  11607.62012  11607.62012   
6/27/2022  11661.01953  11677.49023  11487.07031  11524.54981  11524.54981   

               Volume   date_col  
Date                              
6/21/2022  5201450000  6/21/2022  
6/22/2022  5215100000  6/22/2022  
6/23/2022  5238210000  6/23/2022  
6/24/2022  9438810000  6/24/2022  
6/27/2022  5017930000  6/27/2022  


I think I want to find the daily change and add it as a column.

In [3]:
nas['daily_change'] = nas['Close'].diff()

print(nas.head(5))

                  Open         High          Low        Close    Adj Close  \
Date                                                                         
6/21/2022  10974.04981  11164.99023  10974.04981  11069.29981  11069.29981   
6/22/2022  10941.95020  11216.76953  10938.05957  11053.08008  11053.08008   
6/23/2022  11137.67969  11260.26953  11046.28027  11232.19043  11232.19043   
6/24/2022  11351.30957  11613.23047  11337.78027  11607.62012  11607.62012   
6/27/2022  11661.01953  11677.49023  11487.07031  11524.54981  11524.54981   

               Volume   date_col  daily_change  
Date                                            
6/21/2022  5201450000  6/21/2022           NaN  
6/22/2022  5215100000  6/22/2022     -16.21973  
6/23/2022  5238210000  6/23/2022     179.11035  
6/24/2022  9438810000  6/24/2022     375.42969  
6/27/2022  5017930000  6/27/2022     -83.07031  


Actually, I think I'm going to make sure the new dat_col is numeric first.

In [4]:
nas['date_col'] = pd.to_datetime(nas['date_col'])
nas['date_col'] = pd.to_numeric(nas['date_col'])
nas['daily_change'] = pd.to_numeric(nas['daily_change'])

print(nas.head(5))

                  Open         High          Low        Close    Adj Close  \
Date                                                                         
6/21/2022  10974.04981  11164.99023  10974.04981  11069.29981  11069.29981   
6/22/2022  10941.95020  11216.76953  10938.05957  11053.08008  11053.08008   
6/23/2022  11137.67969  11260.26953  11046.28027  11232.19043  11232.19043   
6/24/2022  11351.30957  11613.23047  11337.78027  11607.62012  11607.62012   
6/27/2022  11661.01953  11677.49023  11487.07031  11524.54981  11524.54981   

               Volume             date_col  daily_change  
Date                                                      
6/21/2022  5201450000  1655769600000000000           NaN  
6/22/2022  5215100000  1655856000000000000     -16.21973  
6/23/2022  5238210000  1655942400000000000     179.11035  
6/24/2022  9438810000  1656028800000000000     375.42969  
6/27/2022  5017930000  1656288000000000000     -83.07031  


In [5]:
nas.dropna(subset=['daily_change'])

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,date_col,daily_change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6/22/2022,10941.95020,11216.76953,10938.05957,11053.08008,11053.08008,5215100000,1655856000000000000,-16.21973
6/23/2022,11137.67969,11260.26953,11046.28027,11232.19043,11232.19043,5238210000,1655942400000000000,179.11035
6/24/2022,11351.30957,11613.23047,11337.78027,11607.62012,11607.62012,9438810000,1656028800000000000,375.42969
6/27/2022,11661.01953,11677.49023,11487.07031,11524.54981,11524.54981,5017930000,1656288000000000000,-83.07031
6/28/2022,11542.24023,11635.84961,11177.67969,11181.54004,11181.54004,5397910000,1656374400000000000,-343.00977
...,...,...,...,...,...,...,...,...
9/14/2022,11680.41016,11746.83008,11602.75977,11719.67969,11719.67969,4861530000,1663113600000000000,86.10938
9/15/2022,11633.24023,11760.73047,11497.11035,11552.36035,11552.36035,4805910000,1663200000000000000,-167.31934
9/16/2022,11401.20996,11460.42969,11316.91992,11448.40039,11448.40039,7451840000,1663286400000000000,-103.95996
9/19/2022,11338.57031,11538.12988,11337.83008,11535.01953,11535.01953,4168670000,1663545600000000000,86.61914


Something really cool I found was using matplotlib along with numpy to create linear regression models and then predict outcomes!

In [19]:
x = nas['date_col']
y = nas['Close']
model = np.polyfit(x,y,1)

model

array([ 1.03125691e-13, -1.59220837e+05])

In [24]:
predict = np.poly1d(model)
date_col = 1663632000000000000+(864*365)
predict(date_col)

12342.36243797891

## Answer: The predicted closing value of Nasdaq 1 year from now is 12342.36244