# (Treehouse) Data Science Basics

This course will follow the basic procedures of conducting data science work, namely selecting and describing data, and munging it into a communicable form. At the end of this course, students will be able to pick a small dataset available online and, using Python language, quickly calculate descriptive statistics and show their results with basic charts and tables.

What you'll learn:
* What is data science?
* Loading raw data
* Cleaning data
* MatPlotLib
* NumPy
* Creating reports

>[(Treehouse) Data Science Basics](#scrollTo=V0xHYnSedbrk)

>>[Getting Started with Data Science](#scrollTo=aFul6QsCd3fs)

>>>[What is data science?](#scrollTo=rR1hgCIXgEe4)

>>>[Selecting data](#scrollTo=gvfEnJZIrdFQ)

>>>[Obtaining data](#scrollTo=9l_d6dTHrdct)

>>>[Installing libraries](#scrollTo=Lawq07y7reYu)

>>[Describing Data](#scrollTo=d11IW33ZeGlj)

>>>[Loading raw data](#scrollTo=vMDGVClAsGQo)

>>>[Calculating descriptive statistics](#scrollTo=SDWSDfr8tEJR)

>>>[Calculating sums and totals](#scrollTo=H76Rn1Bn9LGR)

>>>[Calculating averages](#scrollTo=ChPvQd0r9O48)

>>>[Max and min](#scrollTo=rDMJ8VIj9UvD)

>>[Cleaning Data](#scrollTo=l5IRrzW1eX4u)

>>>[Cleaning up your data](#scrollTo=ZKYXo7f59dP4)

>>>[Filering rows](#scrollTo=N2rjmhQv9ffa)

>>>[Grouping rows](#scrollTo=cv_Ii-B49ht_)

>>[Exporting](#scrollTo=zEkiFifkflKW)

>>>[Exporting CSV files](#scrollTo=KK3IkFAb9loI)

>>>[Writing functions to export csv files](#scrollTo=xQIg7Gli9n07)

>>>[Exporting to Excel](#scrollTo=wHPplrnF9qW_)

>>>[Challenge: Exporting](#scrollTo=9j72ep2bW42o)

>>[Charts](#scrollTo=NwNR0nBdpETJ)

>>>[Line charts](#scrollTo=Lt1mG15LpHuB)

>>>[Bar charts](#scrollTo=v0MiZxompKgD)

>>>[Tables](#scrollTo=ce80c2CupM27)

>>[Reports](#scrollTo=q1U2YxPXpZKW)

>>>[Styling](#scrollTo=35yjVF5apbIt)

>>>[Saving PDFs](#scrollTo=Nww3XP2CpcuS)

>>>[Adding automation](#scrollTo=YRRZwgnipe6f)



## Getting Started with Data Science

Learn one of the many techniques for harvesting raw data and summarizing it into knowledge to share with others

* What is data science?
* Selecting data
* Obtaining data
* Installing libraries

### What is data science?
Data science involves many techniques for harvesting raw data and summarizing it into knowledge to share with others.

### Selecting data

### Obtaining data

### Installing libraries

## Describing Data
We are going to be writing functions to get our raw data into better shape. This will help prepare the data to make creating reports easier.

* Loading raw data
* Calculating descriptive statistics
* Calculating sums and totals
* Calculating averages
* Max and min

### Loading raw data

We're going to be writing functions to get our raw data into better shape. This will help us prepare the data to make our later work of creating reports easier.


*   Once we have our data in a data structure, in our casea a list of lists, we can use the concept of creating copies for extracting smaller samples to work with.
*   We can keep on writing functions for all the repetitive actions





In [0]:
# s2v1.py
# Loading Raw Data

import csv
import numpy

def open_with_csv(filename, d='\t'):
    data = []
    with open(filename, encoding='utf-8') as tsvin:
        tie_reader = csv.reader(tsvin, delimiter=d)
        for line in tie_reader:
            data.append(line)
        return data

data_from_csv = open_with_csv('data.csv')
print(data_from_csv[0])

FIELDNAMES = ['', 'id', 'priceLabel', 'name', 'brandId', 'brandName', 'imageLink', 'desc', 'vendor', 'patterned', 'material'] 

DATATYPES = [('myint', 'i'), ('myid', 'i'), ('price', 'f8'), ('name', 'a200'), ('brandID', '<i8'), ('brandName', 'a200'),
            ('imageURL', '|S500'), ('description', '|S900'), ('vendor', '|S100'), ('pattern', 'S50'), ('material', '|S50'),]

def load_data(filename, d='\t'):
    my_csv = numpy.genfromtxt(filename, delimiter=d, skip_header=1, invalid_raise=False, names=FIELDNAMES, dtype=DATATYPES)
    return my_csv

my_csv = load_data('data.csv')

['', 'id', 'priceLabel', 'name', 'brandId', 'brandName', 'imageLink', 'desc', 'vendor', 'print', 'material']


### Calculating descriptive statistics
Using built-in functions to desribe the data

In [0]:
# length of a list
apples = [1,2,3]
len(apples)

3

In [0]:
def number_of_records(data_sample):
  return len(data_from_csv)
  
number_of_ties = number_of_records(data_from_csv) - 1 # first line is header

print(number_of_ties, "ties in our data sample")

5050 ties in our data sample


* **sort(x)**: sorts the items of the list in place. It only modifies the list and has no return value printed
* **reverse(x)**: reverses the elements of the list in place
* **list.count(x)**: the number of times "x" appears in the list
* **list.append(x)**: appds an item to the end of the list

In [0]:
my_apple_bin = ["green", "red", "red", "green", "red", "green"]
your_apple_bin = ["green", "green", "red", "green", "green", "red"]

my_green_apples = my_apple_bin.count("green")
print(my_green_apples)

your_green_apples = your_apple_bin.count('red')
print(your_green_apples)

all_green_apples = my_green_apples + your_green_apples
print(all_green_apples)

3
2
5


In [0]:
def number_of_records2(data_sample):
  return data_sample.size

number_of_ties_my_csv = number_of_records2(my_csv)
print(number_of_ties_my_csv, "ties in our data sample")

5050 ties in our data sample


### Calculating sums and totals
Using NumPy to calculate sums

In [0]:
# s2v3.py
# from s2v2 import *

def calculate_sum(data_sample):
  total = 0
  for row in data_sample[1:]:
    price = float(row[2])
    total += price
  return total

print(calculate_sum(data_from_csv))

702600.6900000003


In [0]:
# list comprehension to make more succinct

def calculate_sum_succinct(data_sample):
  prices = [float(row[2]) for row in data_sample[1:]]
  return sum(prices)

print(calculate_sum_succinct(data_from_csv))

702600.6900000003


In [0]:
# lambda function list comprehension

def calculate_sum_concise(data_sample):
  prices = list(map(lambda x: float(x[2]), data_sample[1:]))
  return sum(prices)

print(calculate_sum_concise(data_from_csv))

702600.6900000003


In [0]:
# numpy sum function

def calc_numpy_sum(price):
  prices_in_float = [float(line) for line in price]
  total = numpy.sum(prices_in_float)
  return total

price = my_csv['priceLabel']
my_sum = calc_numpy_sum(price)
my_sum

702600.69

### Calculating averages
The 'mean', or the average, depending on how you want to call it can easily be calculated with a simple formula. It is thet otal divided by the number of records

In [0]:
def find_average(data_sample, header=False):
  if header:
    data_sample = data_sample[1:]
  total = calculate_sum(data_sample)
  size = number_of_records(data_sample)
  average = total / size
  return average

average_price = find_average(data_from_csv, True)
print("Average =",average_price)

# format the number more cleanly using string formatting - to two decimal places
print('Average = {:03.2f}'.format(average_price))

Average = 139.0775806770937
Average = 139.08


In [0]:
# check values of your variables using the built in type() function

print(average_price, int(average_price)) # rounds int down
print(type(int(average_price)))
print(type(average_price))
print(type(data_from_csv))
print(type(my_csv))

139.0775806770937 139
<class 'int'>
<class 'float'>
<class 'list'>
<class 'numpy.ndarray'>


In [0]:
midpoint = round(number_of_ties / 2)
message = "Average of {} half = ${:03.2f}"
print(message.format("", find_average(data_from_csv[:midpoint], True)))
print(message.format("", find_average(data_from_csv[midpoint:], False)))

Average of  half = $79.26
Average of  half = $59.78


### Max and min
Maximum and minimum values are different from the previous types of metrics. Instead of providing a figure that takes the whole range, it provides a value that represents the end of a range

In [0]:
# s2v5.py

def find_max(data_sample, col):
  temp_list = []
  for row in data_sample:
    price = float(row[col])
    temp_list.append(price)
  return max(temp_list)

# most expensive tie in our data sample
print(find_max(data_from_csv[1:], 2))

711.0


In [0]:
def find_min(data_sample, col):
  temp_list = []
  for row in data_sample:
    price = float(row[col])
    temp_list.append(price)
  return min(temp_list)

# least expensive tie in our data sample
print(find_min(data_from_csv[1:], 2))

10.0


In [0]:
# combine the two functions above into one

def find_max_min(data_sample, col, m='max'):
    temp_list = []
    val = 0
    for row in data_sample:
        price = float(row[col])
        temp_list.append(price)
    if m == "max":
        val = max(temp_list)
    elif m == "min":
        val = min(temp_list)
    else: # hopefully we don't come to this
        pass
    return val   

# cheapest tie in our data sample
print(find_max_min(data_from_csv[1:], 2, 'min'))

# most expensive tie in our data sample
print(find_max_min(data_from_csv[1:], 2, 'max'))

10.0
711.0


In [0]:
price = my_csv['priceLabel']
price_in_float = [float(x) for x in price]

numpy_max = numpy.amax(price_in_float)
print(numpy_max)

711.0


## Cleaning Data
Filter through data to find specific characteristics
* Cleaning up your data
* Filtering rows
* Grouping rows

### Cleaning up your data
No matter your data source, you should always expect to clean your datasets

In [0]:
# s3v1
def create_bool_field_from_search_term(data_sample, search_term):
    new_array = []
    new_array.append(data_sample[0].append(search_term))
    
    for row in data_sample[1:]:
        new_bool_field = False
        if search_term in row[7]:
            new_bool_field = True
            
            row.append(new_bool_field)
            new_array.append(row)
            
    return new_array
  

my_new_csv = create_bool_field_from_search_term(data_from_csv, "cashmere")
#print("Length:", number_of_records(my_new_csv))

def filter_column_by_bool(data_sample, col):
  matches_search_term = []
  
  for item in data_sample[1:]:
    if item[col]:
      matches_search_term.append(item)
  return matches_search_term

my_new_csv = create_bool_field_from_search_term(data_from_csv, "cashmere")
number_of_cashmere_ties =  number_of_records(filter_column_by_bool(my_new_csv, 11))
print("Length:", number_of_cashmere_ties)

Length: 5051


In [0]:
# Clean Up Your Data

def create_bool_field_from_search_term(data_sample, search_term):
    new_array = []
    new_array.append(data_sample[0].append(search_term))
    
    for row in data_sample[1:]:
        new_bool_field = False
        if search_term in row[7]:
            new_bool_field = True
            
            row.append(new_bool_field)
            new_array.append(row)
            
    return new_array

def filter_column_by_bool(data_sample, col):
    matches_search_term = []
    
    for item in data_sample[1:]:
        if item[col]:
            matches_search_term.append(item)
            
    return matches_search_term    
        

my_new_csv = create_bool_field_from_search_term(data_from_csv, "cashmere")
number_of_cashmere_ties = number_of_records(filter_column_by_bool(my_new_csv, 11))
print("Length:", number_of_cashmere_ties)

Length: 5051


In [0]:
# FROM SOLUTIONS

def create_bool_field_from_search_term(data_sample, search_term):
  new_array = []
  new_array.append(data_sample[0].append(search_term))
  
  for row in data_sample[1:]:
    new_bool_field = False
    if search_term in row[7]: 
      new_bool_field = True
      
    row.append(new_bool_field)
    new_array.append(row)
  
  return new_array

def filter_col_by_bool(data_sample, col):
  matches_search_term = []
  
  for item in data_sample[1:]:
    if item[col]:
      matches_search_term.append(item)
  
  return matches_search_term
  
my_new_csv = create_bool_field_from_search_term(data_from_csv, "cashmere")
number_of_cashmere_ties = number_of_records(filter_col_by_bool(my_new_csv, 11))

print("Length:", number_of_cashmere_ties)

Length: 5051


### Filering rows
Filtering is helpful when you want to look at a subset of data - it is easier to work with smaller samples

In [0]:
# Filter Rows

def filter_col_by_string(data_sample, field, filter_condition):
    filtered_rows = []
    
    col = int(data_sample[0].index(field))
    filtered_rows.append(data_sample[0])
    
    for item in data_sample[1:]:
        if item[col] == filter_condition:
            filtered_rows.append(item)
    
    return filtered_rows

silk_ties = filter_col_by_string(data_from_csv, "material", "_silk")
wool_ties = filter_col_by_string(data_from_csv, "material", "_wool")
cotton_ties = filter_col_by_string(data_from_csv, "material", "_cotton")
gucci_ties = filter_col_by_string(data_from_csv, "brandName", "Gucci")

print("Found {} Gucci ties.".format(number_of_records(gucci_ties)))
print("Found {} silk ties.".format(number_of_records(silk_ties)))
print("Found {} wool ties.".format(number_of_records(wool_ties)))
print("Found {} cottom ties.".format(number_of_records(cotton_ties)))

def filter_col_by_float(data_sample, field, direction, filter_condition):
    filtered_rows = []
    
    col = int(data_sample[0].index(field))
    cond = float(filter_condition)
    
    for row in data_sample[1:]:
        element = float(row[col])
        
        if direction == "<":
            if element < cond:
                filtered_rows.append(row)
        elif direction == "<=":
            if element <= cond:
                filtered_rows.append(row)
        elif direction == ">":
            if element > cond:
                filtered_rows.append(row)
        elif direction == ">=":
            if element >= cond:
                filtered_rows.append(row)
        elif direction == "==":
            if element == cond:
                filtered_rows.append(row)
        else:
            pass
    return filtered_rows

under_20_bucks = filter_col_by_float(data_from_csv, "priceLabel", "<=", 20)
print("Found {} ties < $20".format(number_of_records(under_20_bucks)))

Found 5051 Gucci ties.
Found 5051 silk ties.
Found 5051 wool ties.
Found 5051 cottom ties.
Found 5051 ties < $20


### Grouping rows
Learning how to combine objects that match a condition

Key terms:
* **filter** is a type of selection that refers to restricting the result set to contain only those elements that satisfy a specified condition
* **grouping** refers to the operation of putting data into groups so that the elements in each group share a common attribute

In [0]:
gucci_ties = filter_col_by_string(data_from_csv, "brandName", "Gucci")
jcrew_ties = filter_col_by_string(data_from_csv, "brandName", "J.Crew")

# compare maximum prices
max_gucci = find_max(gucci_ties[1:], 2)
max_jcrew = find_max(jcrew_ties[1:], 2)

# compare average prices
avg_gucci = find_average(gucci_ties, True)
avg_jcrew = find_average(jcrew_ties, True)

message = "{} {} tie price is ${:03.2f}"

print(message.format("Max","Gucci", max_gucci))
print(message.format("Max","J.Crew", max_jcrew))

print(message.format("Average", "Gucci", avg_gucci))
print(message.format("Average", "J.Crew", avg_jcrew))

# ties by pattern type
striped_ties = filter_col_by_string(data_from_csv, "print", "_striped")
print_ties = filter_col_by_string(data_from_csv, "print", "_print")
paisley_ties = filter_col_by_string(data_from_csv, "print", "_paisley")
solid_ties = filter_col_by_string(data_from_csv, "print", "_solid")

message2 = "{}\t${:03.2f}"
print("\nPrint\tAverage")
print(message2.format("striped", find_average(striped_ties)))
print(message2.format("print", find_average(print_ties)))
print(message2.format("paisley", find_average(paisley_ties)))
print(message2.format("solid", find_average(solid_ties)))

Max Gucci tie price is $545.00
Max J.Crew tie price is $79.50
Average Gucci tie price is $5.58
Average J.Crew tie price is $1.22

Print	Average
striped	$20.08
print	$19.79
paisley	$7.87
solid	$2.23


In [0]:
gucci_ties = filter_col_by_string(data_from_csv, "brandName", "Gucci")
jcrew_ties = filter_col_by_string(data_from_csv, "brandName", "J.Crew")

max_gucci = find_max(gucci_ties[1:], 2)
max_jcrew = find_max_min(jcrew_ties[1:], 2)

message = "{} {} tie price is = ${:03.2f}"
print(message.format("Maximum", "Gucci", max_gucci))
print(message.format("Maximum", "J.Crew", max_jcrew))

avg_gucci = find_average(gucci_ties, True)
avg_jcrew = find_average(jcrew_ties, True)
print(message.format("Average", "Gucci", avg_gucci))
print(message.format("Average", "J.Crew", avg_jcrew))

striped_ties = filter_col_by_string(data_from_csv, "print", "_striped")
print_ties = filter_col_by_string(data_from_csv, "print", "_print")
paisley_ties = filter_col_by_string(data_from_csv, "print", "_paisley")
solid_ties = filter_col_by_string(data_from_csv, "print", "_solid")

message2 = "{}\t${:03.2f}"
print("Print\tAverage")
print(message2.format("striped", find_average(striped_ties)))
print(message2.format("print", find_average(print_ties)))
print(message2.format("paisley", find_average(paisley_ties)))
print(message2.format("solid", find_average(solid_ties)))

Maximum Gucci tie price is = $545.00
Maximum J.Crew tie price is = $79.50
Average Gucci tie price is = $5.58
Average J.Crew tie price is = $1.22
Print	Average
striped	$20.08
print	$19.79
paisley	$7.87
solid	$2.23


## Exporting
In this stage we'll be exporting data into files, in case you want to save them for future reference, or open them in a different software such as your favorite spreadsheet application.

* Exporting CSV files
* Writing functions to export csv files
* Exporting to Excel


### Exporting CSV files
The CSV format is a very common data format to export or import data. Most spreadsheets and database applications offer export and import for CSV files

Key concepts:
* NumPy comes with functions for writing files, numpy.savetxt()
* syntax:
```
csv.writer(csvfilename, **fmtparams)
```

In [0]:

def write_to_file(filename, data_sample):
  example = csv.write(open(filename, 'w', encoding='utf-8'), dialect='excel')
  example.writerows(data_sample)
  
write_to_file("_data/s4-silk_ties.csv", silk_ties)

### Writing functions to export csv files
Learn som eof the specific functions you will use to export your CSV files

In [0]:
def write_brand_and_price_file(filename, data_sample):
  brand_field_index = 5
  price_field_index = 2
  
  new_array = []
  
  for record in data_sample:
    new_record = [None] * 2
    new_record[0] = record[brand_field_index]
    new_record[1] = record[price_field_index]
    new_array.append(new_record)
  
  write_to_file(filename, new_array)

write_brand_and_file_price('_data/s4-brand_and_price.csv', gucci_ties)

In [0]:
def write_min_max_csv(filename, data_sample):
  min = find_max_min(data_sample, 2, "min")
  max = find_max_min(data_sample, 2, "max")
  
  new_array = []
  for record in data_sample:
    if (float(record[2]) == min) or (float(record[2]) == max):
      new_array.append(record)

  write_to_file(filename, new_array)
  
write_min_max_csv('_data/s4-min_max_csv', gucci_ties[1:])

In [0]:
def write_two_cols(filename, data_sample, col1, col2):
  new_array = []
  for record in data_sample:
    new_record = [None] * 2
    new_record[0] = record[col1]
    new_record[1] = record[col2]
    new_array.append(new_record)
  write_to_file(filename, array)

write_to_cols('_data/s4-write_two_cols.csv', gucci_ties[1:],3,7)

### Exporting to Excel
Excel is still one of the most common formats when sharing documents

In [0]:
!pip install openpyxl

Collecting openpyxl
[?25l  Downloading https://files.pythonhosted.org/packages/04/18/64737cc6c5233e15374d21b4958a5600be52359e71063b4d4e7a604a1387/openpyxl-2.5.9.tar.gz (1.9MB)
[K    100% |████████████████████████████████| 1.9MB 6.4MB/s 
[?25hCollecting jdcal (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/a0/38/dcf83532480f25284f3ef13f8ed63e03c58a65c9d3ba2a6a894ed9497207/jdcal-1.4-py2.py3-none-any.whl
Collecting et_xmlfile (from openpyxl)
  Downloading https://files.pythonhosted.org/packages/22/28/a99c42aea746e18382ad9fb36f64c1c1f04216f41797f2f0fa567da11388/et_xmlfile-1.0.1.tar.gz
Building wheels for collected packages: openpyxl, et-xmlfile
  Running setup.py bdist_wheel for openpyxl ... [?25l- \ | done
[?25h  Stored in directory: /root/.cache/pip/wheels/57/41/b9/3765af8bda4a8d4b6aaf4957d7214984c3332348713e85cf36
  Running setup.py bdist_wheel for et-xmlfile ... [?25l- done
[?25h  Stored in directory: /root/.cache/pip/wheels/2a/77/35/0da0965a0576981

In [0]:
#s4v3.py
# Exporting to Excel
import openpyxl
from openpyxl import Workbook
from openpyxl.writer.excel import ExcelWriter
from openpyxl.cell import get_column_letter
from s4v2 import *

def save_spreadsheet(filename, data_sample):
    wb = Workbook()
    ws = wb.active
    row_index = 1
    for rows in data_sample:
        col_index = 1
        
        for field in rows:
            col_letter = get_column_letter(col_index)
            ws.cell('{}{}'.format(col_letter, row_index)).value = field
            col_index += 1
        row_index += 1
    
    wb.save(filename)

kiton_ties = filter_col_by_string(data_from_csv, "brandName", "Kiton")
save_spreadsheet("_data/s4-kiton.xlsx", kiton_ties)

### Challenge: Exporting

Using the `filter_col_by_string` function, export all "Dolce & Gabbana" brand ties to a variable named `dolce_gabbana`

In [0]:
from openWithCsv import *

data_from_csv = open_with_csv('data.csv')

def filter_col_by_string(the_data, field, filter_condition):
    filtered_rows = []
    
    #find index of field in first row
    col = int(the_data[0].index(field))
    filtered_rows.append(the_data[0])

    for row in the_data[1:]:
        if row[col] == filter_condition:
            filtered_rows.append([str(x).encode('utf8') for x in row])
            
    return filtered_rows

# YOUR CODE HERE
dolce_gabbana = filter_col_by_string(data_from_csv, "brandName", "Dolce & Gabbana")

## Charts

Now that you have your usable data, how to view it depends on your audience and their needs

### Line charts
Line charts are known as "plots" in matplotlib. In this video, we are going to use the ggplot stylesheet instead of the default styles.

Basic matplotlib plotting concepts:
Use the module `plot()` to indicate the x and y variables, colors, labels, and other properties such as line width.
* **figure**: a container for all the Axes
* **axes**: the space where what you draw (i.e., your plot) actually shows up
* **axis**: the actual x axis or y axis

In [0]:
# Line charts
from s4v3 import *
import matplotlib.pyplot as plt
import numpy as np

def create_line_chart(data_sample, title, exported_figure_filename):
  fig = plt.figure()
  ax = fig.add_subplot(1,1,1)
  
  prices = (sorted(map(float, data_sample)))
  
  x_axis_ticks = list(range(len(data_sample)))
  ax.plot(x_axis_ticks, prices, linewidth=2)
  ax.set_title(title)
  ax.set_xlim([0, len(data_sample)])
  ax.set_xlabel('Tie Price ($)')
  ax.set_ylabel('Number of ties')
  
  fig.savefig(exported_figure_filename)

create_line_chart([x[2] for x in gucci_ties[1:]], "Distribution of prices for Gucci Ties", "_charts/gucci.png")

### Bar charts
Create a bar chart using matplotlib and numpy

In [0]:
def plot_all_bars(prices_in_float, exported_figure_filename):
  prices = list(map(int, prices_in_float))
  X = numpy.arange(len(prices))
  width = 0.25
  plt.bar(X+width, prices, width)
  plt.xlim([0,5055])
  plt.savefig(exported_figure_filename)

def create_bar_chart(price_groups, exported_figure_filename):
  fig = plt.figure()
  ax = fig.add_subplot(1,1,1)
  plt.style.use('ggplot')
  colors=plt.rcParams['axes.color_cycle']
  
  for group in price_groups:
    ax.bar(group, price_groups[group], colors=colors[group%len(price_groups)])
    
    labels = ["$0-$50","$50-$100","$100-$150","$150-$200","$200-$250","$250+"]
    ax.legend(labels)
    
    ax.set_title("Amount of ties at price points")
    ax.set_xlabel("Tie Price ($)")
    ax.set_xticklabels(labels, ha='left')
    ax.set_xticks(range(1, len(price_groups)+1))
    ax.set_ylabel("Number of ties")

from collections import Counter
def group_prices_by_range(prices_in_float):
  tally = Counter()
  for item in prices_in_float:
    bucket = 0
    rounded_price = round(item, -1)
    if rounded_price >= 0 and rounded_price <= 50:
      bucket = 1
    elif rounded_price >= 50 and rounded_price <= 100:
      bucket = 2
    elif rounded_price >= 100 and rounded_price <= 150:
      bucket = 3
    elif rounded_price >= 150 and rounded_price <= 200:
      bucket = 4
    elif rounded_price >= 200 and rounded_price <= 250:
      bucket = 5
    elif rounded_price >= 250:
      bucket = 6
    else:
      bucket = 7
    
    tally[bucket] += 1
  return tally

price_groups = group_prices_by_range(price_in_float)
create_bar_chart(price_groups, "_charts/s5-price_in_groups.png")

### Tables
At times, showing data in a chart doesn't provide enough context or details, so we want to use tables to show a more detailed view of specific characteristics of each data point.

In [0]:
#s5v3.py
# Tables

import matplotlib.pyplot as plt
from prettytable import PrettyTable

def my_table():
  x = PrettyTable(['Style','Average Price'])
  x.add_row(['Print', pretty_average(print_ties)])
  x.add_row(['Solid', pretty_average(solid_ties)])
  x.add_row(['Paisley', pretty_average(paisley_ties)])
  x.add_row(['Striped', pretty_average(striped_ties)])
  x.add_row(['Gucci', pretty_average(gucci_ties)])
  
  print(x)
  

def pretty_average(my_number):
  pretty_avg = "${:03.2f}".format(find_average(my_number))
  return pretty_avg
  
#my_table()

# working with pdfs

def count_prices_for_brands(data_sample, brand, min_price, max_price):
  count = 0
  for row in data_sample:
    if str(row[0]) == str(brand):
      if min_price < float(row[1]) < max_price:
        count += 1
  return count

def create_table(data_sample, price_groups, brand_names, columns, exported_figure_filename):
    tup = build_table_text(data_sample, brand_names)
    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)

    for group in price_groups:
        plt.bar(group, price_groups[group]) 
    
    if tup[0] and tup[1]:
        ax.table(cellText=tup[0], colLabels=columns, rowLabels=tup[1], loc='bottom')
        ax.text(-1.3, 0, 'Discounted Ties Brands', size=12, horizontalalignment='left', verticalalignment='top')
        ax.tick_params(
            axis='x',          # changes apply to the x-axis
            which='both',      # both major and minor ticks are affected
            labelbottom='off') # labels along the bottom edge are off

    fig.savefig(exported_figure_filename, dpi=400, bbox_inches='tight')

from collections import Counter
def group_prices_by_range(prices_in_float):
  tally = Counter()

  for item in prices_in_float:
    bucket = 0
    rounded_price = round(item, -1)
    if 0 <= rounded_price <= 50:
      bucket = 1
    elif 50 <= rounded_price <= 100:
      bucket = 2
    elif 100 <= rounded_price <= 150:
      bucket = 3
    elif 150 <= rounded_price <= 200:
      bucket = 4
    elif 200 <= rounded_price <= 250:
      bucket = 5
    elif 250 <= rounded_price:
      bucket = 6
    else:
      bucket = 7

    tally[bucket] += 1
  return tally
  
brands = my_csv['brandName']
columns = ["$0-50", "$50-100", "$100-150", "$150-200", "$200-250", "$250+"]
write_brand_and_price_file("_data/tempTableFile.csv", data_from_csv)
brand_and_price_data = open_with_csv("_data/tempTableFile.csv", d=',')
create_table(brand_and_price_data, price_groups, brands, columns, "_charts/s5_prices_in_table.png")

## Reports

### Styling
Edit title and axes labels

In [0]:
from s5v3 import *

def create_line_chart2(data_sample, title, exported_figure_filename):
  fig = plt.figure()
  ax = fig.add_subplot(1,1,1)
  
  prices = (sorted(map(float, data_sample)))
  
  x_axis_ticks = list(range(len(data_sample)))
  ax.plot(x_axis_ticks, prices, linewidth=2)
  ax.set_title(title)
  ax.set_xlim([0, len(data_sample)])
  ax.set_xlabel("[THIS IS THE X]")
  ax.set_ylabel("[THIS IS THE Y]")
  
  fig.savefig(exported_figure_filename)

create_line_chart2([float(x[2]) for x in jcrew_ties], "THIS IS THE TITLE", "labels.png")

### Saving PDFs

Using matplotlib you will directly move your plots into PDF pages.

The key steps to saving PDFs are:
1.   Import PdfPages module
2.   Create new object of PdfPages with a filename
3.   Save figure(s)
4.   Close the object

Corresponding code:

```
from matplotlib.backends.backend_pdf import PdfPages

pp = PdfPages('foo.pdf')

pp.savefig()

pp.close()
```




In [0]:
# CREATE PDF

import matplotlib.pyplot as plt
from matplotlib.backends.backend_pdf import PdfPages
#from s6v1 import *

def plot_minimal_graph(tally, columns, *args):
  plt.style.use('bmh')
  fig = plt.figure(dpi=200)
  colors = plt.rcParams['axes.color_cycle']
  
  # white background to use less color ink
  ax = plt.subplot(111, axisbg='white')
  
  # plot bars and screate text labels for the table
  for priceBucket in tally:
    ax.bar(priceBucket, tally[priceBucket], color=colors[priceBucket%len(tally)])
    ax.annotate(r"%d" % (tally[priceBucket]),
               (priceBucket+0.2, tally[priceBucket]),
               va="bottom", ha="center")
    
    # include a legend
    ax.legend(columns)
    
    # remove distracting lines on top, left, and right
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    
    # remove distracting tick marks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    # add chart title and axes labels
    plt.xlabel("Tie Price", fontsize = 13)
    plt.ylabel("Number of Ties", fontsize = 13)
    plt.title("Chart # 1")
    
    # add labels to bars along x axes
    x = range(1, len(tally)+1)
    plt.xticks(x, columns, rotation='horizontal',ha='left')
    
    return fig
  
def plot_graph_with_table(cell_text, row_text, columns):
  plt.style.use('ggplot')
  fig = plt.figure()
  
  
  # Include table
  ax2 = fig.add_subplot(111)
  ax2.axis("off")
  
  the_table = ax2.table(cellText=cell_text, rowLabels=row_text, colLabels=columns, loc='center right')
  
pp = PdfPages('my_report.pdf')

plot1 = plot_minimal_graph(price_groups, columns)
pp.savefig(plot1, bbox_inches='tight')

table_text = build_table_text(brand_and_price_data, brands)
plot2 = plot_graph_with_table(table_text[0], table_text[1], columns)
plt.save(plot2, bbox_inches='tight')

pp.close()

### Adding automation

By layering the functions we've written, we can write coordinating functions