# Stock market analysis demo

This was inspired by the short "Intro to Data Analysis" course I just completed. It was not the best course ever and I realised after a while that I'd probably already gone beyond what this course was doing a while ago. The first half of the course was about getting data from APIs (stock market prices in this case), but unfortunately the APIs the instructor was using had been taken down a long time ago and he admitted in the videos that he had no other sources. So actually writing the code he was using was pointless.

At least he provided the CSV files so we could attempt the rest. I stuck around for the section on setting up and running an sqlite database, which I hadn't done before, and skipped through the SQL tutorial to the part about accessing databases through Python. I think that was the one part that was slightly useful.

But then we had to do some actual analysis using the data from the csv files, and that was painful, at best. Mainly because the instructor was walking everyone through some very complicated dictionaries and loops in order to calculate the total value of two stocks at close, by day, for a year. And all I could think was: "This could be done so much faster and with much less code if he was just using pandas and DataFrames".

So this notebook was me setting out to work out how to do that. All I wanted to do was grab the two CSV files, add up the total closing value by date, and then plot it on a graph. So here we go.

In [80]:
import csv, os
#from datetime import datetime
#from dateutil.parser import parse
import numpy as np
import pandas as pd

I commented out the datetime and dateutil.parser imports because they are unnecessary with a pandas DateFrame. The original code from the instructor needed them to turn the csv dates into a real date, but pandas has a to_datetime() method that is much cleaner and produces the same timestamp.

In [81]:
df1 = pd.read_csv(os.getcwd() + '/goog.csv')
df2 = pd.read_csv(os.getcwd() + '/f.csv')

In [82]:
df1['Symbol'] = 'goog'
df2['Symbol'] = 'f'

So far so good. Both CSV files have been loaded into DataFrames. Now to combine them into one and reformat the date into an actual date, rather than an undefined datatype that can't be easily sorted for the later graph.

In [83]:
tot_df = pd.concat([df1, df2])
tot_df['Date'] = pd.to_datetime(tot_df['Date'])
tot_df.dtypes

Date      datetime64[ns]
Open             float64
High             float64
Low              float64
Close            float64
Volume             int64
Symbol            object
dtype: object

I'm sure this could have been done with fewer lines of code by a really great coder, but I don't think six lines of code so far is bad when it took the instructor twice as many lines to achieve the same. Now to add a column that's the close multipled by the volume, and because it's massive, I'll display it in $M. The instructor never really explained why he was doing this - I suspect it's because he needed to add some calculations, rather than out of any real feeling of neccessity.

In [84]:
tot_df['Total Value ($M)'] = round((tot_df['Volume']*tot_df['Close'])/1000000, 2)
#tot_df.head(10)
tot_df.sort_values('Date').head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Symbol,Total Value ($M)
250,2016-05-31,731.74,739.73,731.26,735.72,2129545,goog,1566.75
250,2016-05-31,13.49,13.56,13.4,13.49,26096015,f,352.04
249,2016-06-01,13.43,13.44,12.97,13.11,58219591,f,763.26
249,2016-06-01,734.53,737.21,730.66,734.15,1253593,goog,920.33
248,2016-06-02,732.5,733.02,724.17,730.4,1341807,goog,980.06


So, the DataFrame now has its new column. Now I want to create a new DataFrame, summing up that total by date. In the instructor's code, there was a three-layer deep nesting of for loops, a couple of ifs, and some oddities with parsing floats. I think I can pull it off in one line with a DataFrame.

In [85]:
grouped = pd.DataFrame(tot_df.groupby('Date')['Total Value ($M)'].aggregate(sum).reset_index())
grouped.head(10)

Unnamed: 0,Date,Total Value ($M)
0,2016-05-31,1918.79
1,2016-06-01,1683.59
2,2016-06-02,1529.79
3,2016-06-03,1405.42
4,2016-06-06,1459.42
5,2016-06-07,1328.03
6,2016-06-08,1394.99
7,2016-06-09,984.03
8,2016-06-10,1233.83
9,2016-06-13,1173.64


### The graph

This is the tricky part I've never done before. I've used matplotlib, but never pygal, but there's got to be a way. The promise of some new visualization tricks is what kept me going with this course even as I bashed by head on the desk at the instructor's weird dictionary tricks to avoid pandas. I knew it wasn't going to be straight forward, because I'm working in a notebook instead of writing a script, but a bit of Google and I had my answer for how to embed pygal graphs in a notebook. First, I needed some set-up lines to import the necesary libraries.

In [86]:
%matplotlib inline
from IPython.display import SVG, HTML, display
html_pygal = """
<!DOCTYPE html>
<html>
  <head>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/svg.jquery.js"></script>
  <script type="text/javascript" src="http://kozea.github.com/pygal.js/javascripts/pygal-tooltips.js"></script>
    <!-- ... -->
  </head>
  <body>
    <figure>
      {pygal_render}
    </figure>
  </body>
</html>
"""
import pygal

And now I'm ready to pull together a pretty line graph of the closing value of this tiny, two stock market over a year. First I need to create something to hold the dates for the x axis.

In [87]:
dates = tuple(grouped['Date'])

Now I need to build the chart, setting a title, adding labels to the x-axis, and adding one line from my grouped dataframe. Finally, I'll save it to a file (in case I want it later) and display it in this notebook because it would be a shame not to.

In [88]:
from pygal.style import TurquoiseStyle

line_chart = pygal.Line(x_label_rotation=20, show_minor_x_labels=False, 
                        x_title='31 May 2016 to 26 May 2017', style=TurquoiseStyle)
line_chart.title = 'A two stock market value over time'
line_chart.x_labels = dates
line_chart.x_labels_major = [dates[0], dates[int(len(dates)/2)], dates[-1]]
line_chart.add('Value ($M)', grouped['Total Value ($M)'])

#Save the chart to a file and then render the chart inline.
line_chart.render_to_file('line_chart.svg') 
#display({'image/svg+xml': line_chart.render()}, raw=True)
HTML(html_pygal.format(pygal_render=line_chart.render(is_unicode=True)))

## Conclusion

And there we go: the analysis the instructor in that awful course did, all in a lovely notebook and achieved with around half the code, thanks to pandas.

It probably took me longer than copying his code would have taken, because there was a lot of Googling to poke my rusty brain into remembering some of the DataFrame tricks and I had to work out how to display the charts in the notebook, but I'm happy with the result. The pygal charts are much nicer than the matplotlib charts I've used in the past and I particularly like how customizable they are in terms of styles.