# Dow Jones - Historic Stock Data (2000-2020) Analysis

In this Notebook we will have a quick look at the historic Dow Jones data-set, perform feature engineering and use plots to understand the results.


In [None]:
# Import libs...
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # basic plotting options
import seaborn as sns # statistical data visualization

# ... and set variables
FILE_LOCATION = '/kaggle/input/dow-stock-data/dow_historic_2000_2020.csv'


In [None]:
# Read dataframe and show last couple of rows...
df = pd.read_csv(FILE_LOCATION)
df.tail()

...as you can see, we have multiple interesting data columns (you can check the data set description for a detailled explanation of the columns): <https://www.kaggle.com/deeplytics/dow-stock-data>

## Pre-Processing

Now, let's enginere some date-related features and also add a column which shows the daily change percentage of the stock.

In [None]:
# Engineer some basic features...
# Convert date to datetime format
df.date = pd.to_datetime(df.date)
# Add year
df['year'] = df.date.dt.year
# Add month
df['month'] = df.date.dt.month_name().str[0:3]
# Add weekday
df['day_of_week'] = df.date.dt.day_name()
# Calculate the stock movement on a given day
df['pct_change_day'] = round((df.close / df.open - 1) * 100, 3)


Now, let's check our data with pandas describe operation...

In [None]:
df.describe()

...interesing: historically, dow jones stocks increase by an average of 0.024 per cent a day (which is roughly 0.12 per cent / week and approx. 6 per cent per year).

## Plotting

We can also do some plotting to learn more about the data. 

First let us check the average stock movements per day.

In [None]:
# Use a group-by operation to calculate the pct_change mean 
stock_performance = df.groupby('stock').pct_change_day.mean() 

# Chose seaborn style
sns.set_style("darkgrid")
# Use matplotlib and seaborn to plot a lineplot....
plt.figure(figsize=(12,5))
chart = sns.lineplot(x=stock_performance.index, y=stock_performance)
_ = plt.xticks(rotation=45)


We can observe that the *CRM* stock (*Salesforce*) has the highest *mean* intra-day stock performance in this data-set (tnote that his company is rather young and historic performances should not be extrapolated into the future).

Next, let us combine the *pct_change_day* feature with two of the other features we derived from the data-column (month and day-of-week).

In [None]:
# Make lineplot for months...
_ = sns.lineplot(data=df, x="month", y="pct_change_day")

In [None]:
# Make lineplot for day of week...
_ = sns.lineplot(data=df, x="day_of_week", y="pct_change_day")

From the plots above we can see that - *historically* - in June and September stocks dropped most, while in March and November stocks performed best. It is also interesting to see, that in *average* stocks increased every day of the week, except friday.


## What did we learn ?
- Understand which kind of information is included in this data set
- How to easilly engineer some additional features
- Use a plot to get an understanding of the different stock performances
- Plot some of the newly created features to understand their value (month, day of week)

This was only a small example what can be done with the available data set - other features like *dividends*, *trading volume* and *stock splits* could be investigated as well...if you like, **have fun** and success with your own stock data analysis!