
## Web scraping and IO

In [None]:
import pandas as pd

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = pd.read_html('https://www.macrotrends.net/1319/dow-jones-100-year-historical-chart')
df = data[0]
df1 = data[1]

In [None]:
len(data)

In [None]:
df  # The general set of data

In [None]:
data[2]  # Useless data

In [None]:
df1  # Also useless

- This above set of code was just to understand what the data is looking like. There seemed to be 3 sets of data but when the second and third sets of data were brought up, there isn't anything important. Those two seem to be just giving the websites title and the name of the website both times so our analysis will not be using those two data frames.

In [None]:
df.columns  # Shows the names for all the columns in the dataframe

In [None]:
columns = list(df.columns)

In [None]:
print(columns[6])

In [None]:
print(columns[4])

In [None]:
df['Dow Jones Industrial Average - Historical Annual Data', 'Annual % Change']

In [None]:
df['Dow Jones Industrial Average - Historical Annual Data', 'Annual % Change'].min()  # Will output incorrect value

In [None]:
df.describe()  # Does not contain Annual % Change in description

In [None]:
percent_change = df['Dow Jones Industrial Average - Historical Annual Data', 'Annual % Change'].str.replace('%','').astype(float)
percent_change

In [None]:
percent_change.min()

In [None]:
percent_change.max()

In [None]:
percent_change.mean()

In [None]:
df.describe()

In [None]:
df['Dow Jones Industrial Average - Historical Annual Data', 'Annual % Change'] = percent_change

In [None]:
df.describe()

- This last chunk of code had a lot to it so I will break it down. To edit and analyze different columns or rows of data in the data frame, the column or row has to be called upon in the code. When looking at the column name with the original `data` command, there are spaces in all the columns but the column of interest was Annual % Change. There would be a syntax error if you were to use something along the lines of data.Annual % Change. Using the `data.columns` functions can help provide what the names are for the columns but that also ran into an error because it would not provide the full column name and instead substitute it with **...** after character limit was reached.
- I then assigned the variable `columns` into a list and manually chose the column by printing the 7th value, which is the Annual % Change column. The code then returns that long name and now we finally get to see just that specific column. There were a couple of problems however. The % signs that were in the column values were making the values a string and using the `min` function and the `describe` function, Python could not calculate the right values or did not calculate anything at all.
- With those issues what had to be done was the % sign to be removed from the values and then the values are set to be a float. That helped determine the correct `min` and `max` values. Then the `percent_change` variable is set to equal the columns with the % change, which then became a factor with the `.describe` function.


In [None]:
percent_change.plot(xticks = df['Dow Jones Industrial Average - Historical Annual Data', 'Year'])
plt.ylabel('Annual % Change')
plt.xlabel('Year')
plt.show()

Yikes, this is not what we want.

In [None]:
plt.plot(df['Dow Jones Industrial Average - Historical Annual Data', 'Year'], df['Dow Jones Industrial Average - Historical Annual Data', 'Annual % Change'], color = 'k')
plt.xlabel('Year')
plt.ylabel('Annual % Change')
plt.show()

In [None]:
plt.plot(df['Dow Jones Industrial Average - Historical Annual Data', 'Year'], df['Dow Jones Industrial Average - Historical Annual Data', 'Average Closing Price'])
plt.xlabel('Year')
plt.ylabel('Average Closing Price')
plt.show()

In [None]:
plt.plot(df['Dow Jones Industrial Average - Historical Annual Data', 'Year'], df['Dow Jones Industrial Average - Historical Annual Data', 'Year High'], label='Year High')
plt.plot(df['Dow Jones Industrial Average - Historical Annual Data', 'Year'], df['Dow Jones Industrial Average - Historical Annual Data', 'Year Low'], label= 'Year Low')
plt.legend()
plt.xlabel('Year')
plt.ylabel('Average Closing Price')
plt.show()

- This is the next step I believe in analyzing data is to start visualizing it. The first graph gives us a good sense of how the market is performing in terms of the amount in percentage it changes. Whether it be a negative gain or a positive one. What is interesting about that graph is how much more volatile it was in the earlier years compared to the later ones. This does make sense because of a stock being harder to move when it is valued higher but it is cool to see visually.
- The next graph gives us a look of how compounding interest works. The Dow Jones looks flat for a long portion of time and then spikes but that is due to the fact that the market compounds and it shows why people should wait a long time to build wealth for retirment.
- The last graph plots the year high and the year low with each other. I thought it was cool to see the both of them and the differences that may occur

In [None]:
df.to_csv('dow_jones_csv')

In [None]:
df_new = pd.read_csv('dow_jones_csv')

In [None]:
df_new