# Modern Data Science 
**(Module 02: Data Visualization)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session A - Exploratory Data Analysis


This practical session will show you how to use packages for data exploration.


## Content

### Part 1 Matplotlib Module


### Part 2 Plotting a Histogram

2.1 [Dataset](#ds)

2.2 [Histogram](#hist)

2.3 [Boxplot](#boxplot)


### Part 3: Data Understanding

1.1 [Pie Chart](#pie)

1.2 [Bar Chart](#bar)

1.3 [Word Cloud](#wordcloud)

1.4 [Step Plot](#stepplot)

1.5 [Histogram](#histogram)

1.6 [Box Plot](#box)

1.7 [Scatter Plot](#scatter)

### Part 4: Exercise



## <span style="color:#0b486b">1. Matplotlib</span>


matplotlib is a python plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. You can generate plots, histograms, power spectra, bar charts, errorcharts, scatterplots, etc, with just a few lines of code. 

For simple plotting the pyplot interface provides a MATLAB-like interface, particularly when combined with IPython. You have full control of line styles, font properties, axes properties, etc, via an object oriented interface or via a set of functions familiar to MATLAB users.

### <span style="color:#0b486b">1.1 Get started</span>


To get started with `'matplotlib'` you can either execute:

In [None]:
from pylab import *

or

In [None]:
import matplotlib.pyplot

In fact it is a convention to import it under the name of `'plt'`:

In [None]:
import matplotlib.pyplot as plt

**note: The second method is preferred.**

In [None]:
import matplotlib.pyplot as plt
import numpy as np

Regardless of the method you use, it is better to configure matplotlib to embed figures in the notebook instead of opening them in a new window for each figure. To do this use the magic function:

In [None]:
%matplotlib inline

### <span style="color:#0b486b">1.2 `plot`</span>


By using `'subplots()'` you have access to both figure and axes objects. 

In [None]:
x = np.linspace(0, 10)
y = np.sin(x)

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y)

### <span style="color:#0b486b">1.3 title and labels</span>


In [None]:
ax.set_title('title here!')
ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
fig

You can also use $\LaTeX$ in title or labels, or change the font size or font family.

In [None]:
x = np.linspace(-10, 10)
fig, ax = plt.subplots(figsize=(8, 5), dpi=300)
ax.plot(x, x**3-x**2, 'bo', lw=3)

ax.set_title('$x^3-x^2$', fontsize=18)
ax.set_xlabel('$x$', fontsize=18)
ax.set_ylabel('$y$', fontsize=18)

### <span style="color:#0b486b">1.4 Subplots</span>

You can pass the number of subplots to `'subplots()'`. In this case, `'axes'` will be an array that each of its elements associates with one of the subgraphs. You can set properties of each `'ax'` object separately like the cell below. 

Obviously you caould use a loop to iterate over `'axes'`.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)

x = np.linspace(0, 10)

axes[0].plot(x, np.sin(x))
axes[0].set_xlabel('x')
axes[0].set_ylabel('sin(x)')

axes[1].plot(x, np.cos(x))
axes[1].set_xlabel('xx')
axes[1].set_ylabel('cos(x)')

`'cos(x)'` label is overlapping with the `'sin'` graph. You can adjust the size of the graph or space between the subplots to fix it.

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
fig.subplots_adjust(wspace=0.4)
x = np.linspace(0, 10)

axes[0].plot(x, np.sin(x))
axes[0].set_xlabel('x')
axes[0].set_ylabel('sin(x)')

axes[1].plot(x, np.cos(x))
axes[1].set_xlabel('xx')
axes[1].set_ylabel('cos(x)')

### <span style="color:#0b486b">1.5 Legend</span>


In [None]:
x = np.linspace(0, 10)
fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(x, np.sin(x), label='$sin(x)$')
ax.plot(x, np.cos(x), label='$cos(x)$')
# ax.legend(fontsize=16, loc=3)


### <span style="color:#0b486b">1.6 Customizing ticks</span>


In many cases you want to customize the ticks and their labels on x or y axis. First draw a simple graph and look at the ticks on x-axis. 

In [None]:
x = np.linspace(0, 10, num=100)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(x, np.sin(x), x, np.cos(x), lw=2)

You can change the ticks easily with passing a list (or array) to `'set_xticks()'` or `'set_yticks()'`:

In [None]:
xticks = [0, 1, 2, 5, 8, 8.5, 10]
ax.set_xticks(xticks)
fig

In [None]:
Or even you can change the labels:

In [None]:
xticklabels = ['$\gamma$', '$\delta$', 'apple', 'b', '', 'c'] 
ax.set_xticklabels(xticklabels, fontsize=18)
fig

### <span style="color:#0b486b">1.7 Saving figures</span>


In [None]:
x = np.linspace(0, 10)
fig, ax = plt.subplots(figsize=(7, 5))
ax.plot(x, np.sin(x), label='$sin(x)$')
ax.plot(x, np.cos(x), label='$cos(x)$')
ax.legend(fontsize=16, loc=3)
fig.savefig('P03Saved.pdf', format='PDF', dpi=300)

### <span style="color:#0b486b">1.8 Other plot styles</span>

There are many other plot types in addition to simple `'plot'` supported by `'matplotlib'`. You will find a complete list of them on [matplotlib gallery](http://matplotlib.org/gallery.html).

#### <span style="color:#0b486b">1.8.1 Scatter plot</span>



In [None]:
fig, ax = plt.subplots()

x = np.linspace(-0.75, 1., 100)
ax.scatter(x, np.random.randn(x.shape[0]), 
                   s = 250*np.abs(np.random.randn(x.shape[0])), 
                   alpha=0.4,
                   edgecolor='none')
ax.set_title('scatter')

#### <span style="color:#0b486b">1.8.2 Bar plot</span>


In [None]:
fig, ax = plt.subplots()

x = np.arange(1, 6)
ax.bar(x, x**2, align="center")
ax.set_title('bar')

---
## <span style="color:#0b486b">2. Plotting a histogram</span>


<a id = "ds"></a>


### <span style="color:#0b486b">2.1 Dataset</span>


You are provided with a dataset of percentage of body fat and 10 simple body measurements recoreded for 252 men (courtesy of Journal of Statistics Education - JSE). You can read about this and other [JSE datasets here](http://www.amstat.org/publications/jse/jse_data_archive.htm).

First load the data set into an array:

In [None]:
import numpy as np

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/fat.dat.txt'
DataSet = wget.download(link_to_data)

In [None]:
data = np.genfromtxt("fat.dat.txt")
data.shape

Based on the [dataset description](http://www.amstat.org/publications/jse/datasets/fat.txt), 5th column represents the weight in lbs. Index the weight column and call it `'weights'`:

In [None]:
weights = data[:, 5]
weights

Use array operators to convert the weigts into kg. 1 lb equals to 0.453592 kg.

In [None]:
a = 3
a = a + 4
a += 4

In [None]:
weights *= 0.453592
weights = weights.round(2)
weights

<a id = "hist"></a>

### <span style="color:#0b486b">2.2 Histogram</span>


A histogtram is a bar plot that shows you the statistical distribution of the data over a variable. The bars represent the frequency of occurenve by classess of data. We use the package `'matplotlib'` and the function `'hist()'` for plotting the histogram. To learn more about `'matplotlib'` make sure you have read tutorial.

The first line of the cell below if for showing the figure in the notebook and not opening it in a separate window.

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
fig, ax = plt.subplots()
ax.hist(weights)

The `'hist()'` functions automatically group the data over 10 bins. Usually you need to tweek the number of bins to obtain a more expressive histogram.

In [None]:
fig, ax = plt.subplots(figsize=(7, 5))
ax.hist(weights, bins=20)
# title
# label

<a id = "boxplot"></a>

### <span style="color:#0b486b">2.3 Boxplot</span>

A `Boxplot` is a convenient way to graphically display numerical data. 

In [None]:
import matplotlib
fig, ax = matplotlib.pyplot.subplots(figsize=(7, 5))
matplotlib.rcParams.update({'font.size': 14})
ax.boxplot(weights, 0, labels=['group1'])
ax.set_ylabel('weight (kg)', fontsize=16)
ax.set_title('Weights BoxPlot', fontsize=16)

You have already been thought about different sorts of plots, how they help to get a better understanding of the data, and when to use which. In this practical session we will work with `matplotlib` package to learn more about plotting in Python.

---

## 3. Data Understanding 


You have already been thought about different sorts of plots, how they help to get a better understanding of the data, and when to use which. In this practical session we will work with `matplotlib` package to learn more about plotting in Python.

In [None]:
import numpy as np
import csv

import matplotlib.pyplot as plt
%matplotlib inline

<a id = "pie"></a>

### <span style="color:#0b486b">3.1 Pie Chart</span>

Suppose you have the frequency count of a variable (e.g. hair_colour). Draw a pie chart to explain it.

In [None]:
labels = 'Black', 'Red', 'Brown'

# frequency count
hair_colour_freq = [5, 3, 2]  # Black, Red, Brown

# colors
colors = ['yellowgreen', 'gold', 'lightskyblue']

# explode the third one
explode = (0, 0, 0.1)

fig, ax = plt.subplots(figsize=(5, 5))
ax.pie(hair_colour_freq, labels=labels, explode=explode, colors=colors, 
       autopct='%1.1f', shadow=True, startangle=90);

What if we have too many tags and sectors?

In [None]:
# Excellence in Reasearch Australia
labels = ['HEALTH', 'ENGINEERING', 'COMPUTER SCIENCES', 'HUMAN SOCIETY', 
          'TOURISM SERVICES', 'EDUCATION', 'CHEMISTRY', 'BIOLOGY', 'PSYCHOLOGY', 
          'CREATIVE ARTS', 'LINGUISTICS', 'BUILT ENVIRONMENT', 'HISTORY', 
          'ECONOMICS', 'PHILOSOPHY', 'AGRICULTURE', 'ENVIRONMENT', 'TECHNOLOGY', 
          'LAW', 'MATHS', 'EARTH SCIENCES', 'PHYSICS']


# frequency count
xx = [2625.179999, 1306.259999, 1187.039999, 1166.04, 980.8599997, 810.5999998,
      725.6399996, 678.7899998, 436.5999997, 404.3299999, 348.01, 304.33, 294.19, 
      293.02, 282.31, 228.21, 197.3399999, 164.0599998, 157, 50.49999998, 49.60999999, 48.08000005]

fig, ax = plt.subplots(figsize=(10, 10))
ax.pie(xx, labels=labels, autopct="%1.1f");

<a id = "bar"></a>

### <span style="color:#0b486b">3.2 Bar Chart</span>

Use the hair colour data to draw a bar chart.

In [None]:
labels = ['Black', 'Red', 'Brown']
hair_colour_freq = [5, 3, 2]

fig, ax = plt.subplots(figsize=(7, 5), dpi=300)

x_pos = np.arange(len(hair_colour_freq))
colors = ['black', 'red', 'brown']

ax.bar(x_pos, hair_colour_freq, align='center', color=colors)

ax.set_xlabel("Hair Colour")
ax.set_ylabel("Number of participants")
ax.set_title("Hair Colour Distribution")

ax.set_xticks(x_pos)
ax.set_xticklabels(labels)

Now suppose we have the hair colour distribution across genders, so we can plot grouped bar charts. Plot a grouped bar chart to show the distribution of colours acros genders.

In [None]:
"""
        black  red  brown
Male      4     1     3
Female    1     2     2

"""

data = np.array([[4, 1, 3], 
                 [1, 2, 3]])

x_pos = np.arange(2)
width = 0.2
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.bar(x_pos, data[:, 0], width=width, color='black', label='Black', align='center')
ax.bar(x_pos+width, data[:, 1], width=width, color='red', label='Red', align='center')
ax.bar(x_pos+2*width, data[:, 2], width=width, color='brown', label='Brown', align='center')

ax.legend()

ax.set_xlabel("Gender")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of hair colour amongst genders")

ax.set_xticks(x_pos+width)
ax.set_xticklabels(['Male', 'Female'])

Can we plot it more intelligently? We are doing the same thing multiple times! Is it a good idea to use a loop?

In [None]:
"""
        black  red  brown
Male      4     1     3
Female    1     2     2

"""

data = np.array([[4, 1, 3], 
                 [1, 2, 3]])

n_groups, n_colours = data.shape

x_pos = np.arange(n_groups)
width = 0.2
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)

colours = ['black', 'red', 'brown']
labels = ['Black', 'Red', 'Brown']
for i in range(n_colours):
    ax.bar(x_pos + i*width, data[:, i], width=width, color=colours[i], label=labels[i], align='center')
    
ax.legend()

ax.set_xlabel("Gender")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of hair colour amongst genders")

ax.set_xticks(x_pos+width)
ax.set_xticklabels(['Male', 'Female'])

What if we want to group the bar charts based on the hair colour?

In [None]:
"""
        black  red  brown
Male      4     1     3
Female    1     2     2

"""

labels = ['Male', 'Female']
colours = ['r', 'y']
data = np.array([[4, 1, 3], 
                 [1, 2, 3]])

n_groups, n_colours = data.shape
width = 0.2
x_pos = np.arange(n_colours)

fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
for i in range(n_groups):
    ax.bar(x_pos + i*width, data[i, :], width, align='center', label=labels[i], color=colours[i])
ax.set_xlabel("Hair Colour")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of gender amongst hair colours")

ax.set_xticks(x_pos+width/2)
ax.set_xticklabels(labels)

ax.legend()

#### Stacked bar chart

The other type of bar chart is stacked bar chart. draw a stacked bar plot of the hair colour data grouped on hair colours.

In [None]:
"""
        black  red  brown
Male      4     1     3
Female    1     2     2

"""

labels = ['Black', 'Red', 'Brown']
data = np.array([[4, 1, 3], 
                 [1, 2, 3]])

male_freq = data[0,:]

width = 0.4
x_pos = np.arange(n_colours)

fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.bar(x_pos, data[0, :], width, align='center', label='Male', color='r')
ax.bar(x_pos, data[1, :], width, bottom=male_freq, align='center', label='Female', color='y')

ax.set_xlabel("Hair Colour")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of gender amongst hair colours")

ax.set_xticks(x_pos)
ax.set_xticklabels(labels)

ax.legend(loc=0)

draw a stacked bar plot grouped on the gender.

In [None]:
"""
        black  red  brown
Male      4     1     3
Female    1     2     2

"""

labels = ['Black', 'Red', 'Brown']
data = np.array([[4, 1, 3], 
                 [1, 2, 3]])

black = data[:,0]
red = data[:,1]
brown = data[:,2]


x_pos = np.arange(2)
width = 0.4
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.bar(x_pos, data[:, 0], width=width, color='black', label='Black', align='center')
ax.bar(x_pos, data[:, 1], width=width, bottom=black, color='red', label='Red', align='center')
ax.bar(x_pos, data[:, 2], width=width, color='brown', bottom=black+red, label='Brown', align='center')

ax.legend(loc=0)

ax.set_xlabel("Gender")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of hair colour amongst genders")

ax.set_xticks(x_pos)
ax.set_xticklabels(['Male', 'Female'])

In [None]:
labels = ['Black', 'Red', 'Brown']
male_freq = [4, 1, 3]
female_freq = [1, 2, 2]

x_pos = np.arange(3)

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(14, 5), dpi=300)

w1 = 0.2
w2 = 0.4

ax[0].bar(x_pos, male_freq, width=w1, align='center', label='Male', color='r')
ax[0].bar(x_pos+width, female_freq, width=w1, align='center', label='Female', color='y')
ax[1].bar(x_pos, male_freq, width=w2, align='center', label='Male', color='r')
ax[1].bar(x_pos, female_freq, width=w2, bottom=male_freq, align='center', label='Female', color='y')


ax[0].set_xlabel("Hair Colour")
ax[0].set_ylabel("Frequency")
ax[0].set_title("Distribution of gender amongst hair colours")
ax[1].set_xlabel("Hair Colour")
ax[1].set_ylabel("Frequency")
ax[1].set_title("Distribution of gender amongst hair colours")

ax[0].set_xticks(x_pos+width/2)
ax[0].set_xticklabels(labels)
ax[1].set_xticks(x_pos)
ax[1].set_xticklabels(labels)

ax[0].legend()
ax[1].legend(loc=0)

What if we have too many groups? Draw a bar chart for the Excellence in Research Australia data. 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Excellence in Research Australia
labels = ['HEALTH', 'ENGINEERING', 'COMPUTER SCIENCES', 'HUMAN SOCIETY', 
          'TOURISM SERVICES', 'EDUCATION', 'CHEMISTRY', 'BIOLOGY', 'PSYCHOLOGY', 
          'CREATIVE ARTS', 'LINGUISTICS', 'BUILT ENVIRONMENT', 'HISTORY', 
          'ECONOMICS', 'PHILOSOPHY', 'AGRICULTURE', 'ENVIRONMENT', 'TECHNOLOGY', 
          'LAW', 'MATHS', 'EARTH SCIENCES', 'PHYSICS']


# frequency count
xx = [2625.179999, 1306.259999, 1187.039999, 1166.04, 980.8599997, 810.5999998,
      725.6399996, 678.7899998, 436.5999997, 404.3299999, 348.01, 304.33, 294.19, 
      293.02, 282.31, 228.21, 197.3399999, 164.0599998, 157, 50.49999998, 49.60999999, 48.08000005]

xx_pos = np.arange(len(xx))

fig, ax = plt.subplots(figsize=(15, 5))
ax.bar(xx_pos, xx, align='center')
ax.set_xlabel("research subject")
ax.set_ylabel("score")
ax.set_xticks(xx_pos)
ax.set_xticklabels(labels, rotation=90)
ax.set_xlim(-1, len(xx))

<a id = "wordcloud"></a>

### <span style="color:#0b486b">3.3 Wordcloud</span>

As you saw, pie-chart is not very helpful when we have too many sectors. It is hard to read and visually ugly. Instead we can use wordcloud representation. A useful tool is [wordle.net](http://wordle.net). Go to [wordle.net](http://wordle.net) and use it to create a wordcloud for the previous data.

In [None]:
for i in range(len(labels)):
    print("{}:{}".format(labels[i], xx[i]))

In [None]:
!pip install wordcloud

In [None]:
import wget

link_to_data = 'https://github.com/benbrandt/cs50/raw/master/pset5/keys/constitution.txt'
DataSet = wget.download(link_to_data)

In [None]:
from os import path
from wordcloud import WordCloud

# Read the whole text.
text = open('constitution.txt').read()

# Generate a word cloud image
wordcloud = WordCloud().generate(text)

# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")

# lower max_font_size
wordcloud = WordCloud(max_font_size=40).generate(text)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

# The pil way (if you don't have matplotlib)
# image = wordcloud.to_image()
# image.show()

<a id = "stepplot"></a>

### <span style="color:#0b486b">3.4 Step plot</span>

Draw a step plot for the seatbelt data.

In [None]:
freq = np.array([0, 2, 1, 5, 7])
labels = ['Never', 'Rarely', 'Sometimes', 'Most-times', 'Always']
freq_cumsum = np.cumsum(freq)
x_pos = np.arange(len(freq))

fig, ax = plt.subplots()
ax.step(x_pos, freq_cumsum, where='mid')
ax.set_xlabel("Fastening seatbelt behaviour")
ax.set_ylabel("Cumulative frequency")
ax.set_xticks(x_pos)
ax.set_xticklabels(labels)

<a id = "histogram"></a>

### <span style="color:#0b486b">3.5 Histogram</span>

Google for this paper:

``Johnson, Roger W. "Fitting percentage of body fat to simple body measurements." Journal of Statistics Education 4.1 (1996): 265-266.``

Download the dataset and read the dataset description. Draw a histogram of male weights and female weights.

In [None]:
import wget

link_to_data = 'https://ww2.amstat.org/publications/jse/datasets/body.dat.txt'
DataSet = wget.download(link_to_data)

In [None]:
data = np.genfromtxt('body.dat.txt')
m_w = data[data[:, -1] == 1][:, -3]
f_w = data[data[:, -1] == 0][:, -3]

fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.hist(m_w, bins=15, alpha=0.6, label='male')
ax.hist(f_w, bins=15, alpha=0.6, label='female')
ax.set_xlabel("weight (kg)")
ax.set_title("weight distribution amongst gXenders")
ax.legend()

<a id = "box"></a>

### <span style="color:#0b486b">3.6 Boxplot</span>
Draw a box plot for male and female weights of the previous dataset.

In [None]:
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.boxplot([m_w, f_w], labels=['male', 'female'])
ax.set_title("weight distribution amongst genders")

<a id = "scatter"></a>

### <span style="color:#0b486b">3.7 Scatter plot</span>

Draw a scatter plot of the car weights and their fuel consumption as displayed in the lecture.

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/Auto.csv'
DataSet = wget.download(link_to_data)

In [None]:
datafile = 'Auto.csv'
data = np.genfromtxt(datafile, delimiter=',')
data = []
with open(datafile, 'r') as fp:
    reader = csv.reader(fp, delimiter=',')
    for row in reader:
        data.append(row)
miles = [dd[1] for dd in data[1:]]
weights = [dd[5] for dd in data[1:]]

In [None]:
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.scatter(weights,miles, alpha=0.6, edgecolor='none', s=200)
ax.set_xlabel('Car Weight (tons)')
ax.set_ylabel('Miles Per Gallon')

Can I also show the number of cylinders on this graph? In other words use the scatter plot to show three variable?

In [None]:
cylinder = 75 * np.array([int(dd[2]) for dd in data[1:]])

In [None]:
fig, ax = plt.subplots(figsize=(7, 5), dpi=300)
ax.scatter(weights,miles, alpha=0.6, edgecolor='none', s=cylinder)
ax.set_xlabel('Car Weight (tons)')
ax.set_ylabel('Miles Per Gallon')

---
## 4. Mini Exercise


In 1970, US Congress instituted a random selection process for the military draft. All 366 possible birth dates were placed in plastic capsules in a rotating drum and were selected one by one. The first date drawn from the drum received draft number one and eligible men born on that date were drafted first. The data is provided in a text file with a structure like:

```
Day    Month    MO.NUMBER    DAY_OF_YEAR    DRAFT_NO.
1      JAN      1            1              305
2      JAN      1            2              159
.
31     JAN      1            31             221
1      FEB      2            32             86
.
31     Dec      12           366            100
```


Using what you have learnt by now, can you tell if it was a fair lottary or not?

Read the data file and save the values in a 2D array.

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/DraftLottery.txt'
DataSet = wget.download(link_to_data)

In [None]:
data = []
with open('DraftLottery.txt', 'r') as fp:
    reader = csv.reader(fp, delimiter='\t')
    for row in reader:
        data.append(row)
        
birthdays = np.array([int(row[3]) for row in data[1:]])
draft_no = np.array([int(row[4]) for row in data[1:]])
months = np.array([int(row[2]) for row in data[1:]])

Plot a `'scatter plot'` of the draft priority vs birthdays.

In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)
ax.scatter(birthdays, draft_no, alpha=0.7, s = 150, edgecolor='none')
ax.set_xlabel("Birthday (day of the year)", fontsize=12)
ax.set_ylabel("Draft priority value", fontsize=12)
ax.set_title("USA Draft Lottery Data", fontsize=14)

In a truly random lottery there should be no relationship between the date and the draft number. To investigate this further we draw boxplots by months and compare them together.

In [None]:
fig, ax = plt.subplots(figsize=(10, 7), dpi=300)
months_range = range(1, 13)

# boxplot data
boxplot_data = [draft_no[months == mm] for mm in months_range]
ax.boxplot(boxplot_data)

# medians
medians = [np.median(dd) for dd in boxplot_data]
ax.plot(months_range, medians, "g--", lw=2)

# means
means = [dd.mean() for dd in boxplot_data]
ax.plot(months_range, means, "k--", lw=2)

month_labels = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
                "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
ax.set_xlabel("Month", fontsize=12)
ax.set_xticklabels(month_labels)
ax.set_ylabel("Draft priority value", fontsize=12)
ax.set_title("USA Draft Lottery Data", fontsize=14)

While it is impossible to view this trend in a scatterplot of draft number vs. birth date, a series of side-by-side boxplots by month illustrate it clearly. A further investigation of the lottery revealed that the birthdates were placed in the drum by month and were not thoroughly mixed.