# <center>Data Wrangling Excel Files - Stacey Sandy</center>

## Part 1

As suggested in Use Case 1 of the <i>From The Expert</i> (FTE), we need to read the other nine stock files and insert their data into the database. While nine files isn't an undue burden to manually read, we are going to look ahead to the time when we may have 100 log files to read and implement this code using the DRY Principle. DRY stands for

* Don't 
* Repeat
* Yourself

So, while we could create an individual cell to read each stock file, for this assignment, do it in one. I'll give you some help to get started.


<hr>

_Reference:_ <br>
https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

Much of computer science and programming is about identifying and exploiting patterns. In this case, we know there is a pattern to how the files are named, and we know (through inspection) that there is a pattern to the columns of data inside. 

The file names all start with a year, from 2009 straight to 2019 with no breaks:

```
2009_aapl_data.xlsx
2010_aapl_data.xlsx
2011_aapl_data.xlsx
2012_aapl_data.xlsx
2013_aapl_data.xlsx
2014_aapl_data.xlsx
2015_aapl_data.xlsx
2016_aapl_data.xlsx
2017_aapl_data.xlsx
2018_aapl_data.xlsx
2019_aapl_data.xlsx
```

Do we know of anything in Python capable of generating a **range** of numbers like that?

In [1]:
#Hint from the instructor in the assignment 2 Jupyter Notebook.
for x in range(2009, 2020):
    print(x)

2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019


That should be enough to get you going. Feel free to use code and helper functions from the FTE. I gave you the .ipynb file for a reason :).

Remember, 

* If you ran the FTE code, it will have already read in 2009 and created the database file. 
* If you start at 2009, depending on the condition above, you may need to handle the table already existing.

It is your choice whether or not to use the dataset library. As with many things in life, there are tradeoffs -- some things are easier, some not. 

**Deliverable:**

When you are done, you will have a database table with slightly more than 2500 rows in it. **Show this by doing a query that counts rows in the table.**

In [2]:
#Check the type of x.
type(x)

int

Recall, strings do not like numbers (integers) so we must convert them to strings. I also learned the hard way that strings can only be concatenated; therefore, it is best to convert the integers insto strings. 

In [3]:
#This is just a check on the type of x.
print(x)
#Notice how on the year 2019 only populated from the original for loop because it is no logner in the range for loop.

2019


In [4]:
#This applied the variable x to the range values 2009 to 2020.
x = range(2009, 2020)

In [5]:
#Here is a simple print of the x range values in a list
print(list(x))

[2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


In [6]:
#Let's save the list to the variable year.
year = (list(x))

In [7]:
#Confirm year type.
type(year)

list

In [8]:
#Let's print the values in year variable.
print(year)

[2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


<b>My Assignment deliverables begin here:</b>

First let's import all the necessary modules and libraries for this assignment.

In [2]:
#Import the libraries required for working with xcel spreadsheet files, loading them into a database, and creating a workbook.
import dataset
from openpyxl import load_workbook
import sqlite3

Remember, that all the file names follow the same pattern: YYYY_aapl_data.xlsx where YYYY is a year value.<br>
The only thing that changes is the year. :) <br>
<p>We will need to create a variable to hold the constant part (_aapl_data.xlsx).</p><br>
And also put the database file name (stock_prices.db) into a variable, too.

In [10]:
#Here the input name is the xcel spreadsheet file minus the year (YYYY) and the db file name is the stock_prices.db variable.
input_name = '_aapl_data.xlsx'
db_file = 'stock_prices.db'

In [11]:
#Use dataset to create the database (.db) file connection.
db = dataset.connect("sqlite:///" + db_file)

Although, I clearly know the database is empty of tables because I have never done this before. <br>
For learning purposes, I am going to ensure no tables return from the database.<br>
The following for loop asks for the table names and if the list comes back empty, then this confirms that the database is empty.<br> 
If the list has something in it, for loop is asked to drop the table.

In [12]:
#Use a for loop to ensure there are not tables in the database, if so drop the tables.
if (len(db.tables) > 0):
    for table in db.tables:
        db[table].drop()

In [13]:
#Let's check if any table exists within the database.
db.tables

[]

In [14]:
# First, create an empty table. We'll create the columns after that from the spreadsheet's first row.
# The create_table function returns the table so we automatically have a handle
table = db.create_table("aapl")

From hear on in the assignment, we will open our 2009-2019 data, get the header and create table columns!

Dataset inserts a dictionary as a row so we will need to loop through each row of the worksheet and construct a dictionary from the cells. The dictionary keys are the column names (which is conveniently stored in the table.columns list) and the values are the cell values.<br>
Each worksheet has a values property we can use for our loop's iterable to only return values. This property is just a big block of rows so if we want to slice off the header we have to convert it to something list-like: tuple(sheet.values)[1:]
We can build our dictionary from this for loop. <br>
<i>Remember we don't have to insert the id column so we will slice it and remove the id.</i>

In [15]:
#This is the code to create the file names from the previous input_name. 
#Then we conduct a nested loop with 2 for loops to load the data.
for y in year:
    data_file = "aapl_data/" + str(y) + input_name
    workbook = load_workbook(filename=data_file)
    sheet = workbook.active
    print(data_file)

    #Load datetime from the datetime and date library. Then create for loop function to create dictionary from Dataset rows.
    from datetime import datetime,date
    keys = table.columns[1:]

    for values in list(sheet.values)[1:]:
        row = []
        row.append(datetime.strptime(values[0],'%Y/%m/%d').date())
        row = row + list(values[1:])
        d_row = dict(zip(keys,row))
        table.insert(d_row)
#The final output prints the file names of the excel spreadsheets we loaded into the database.

aapl_data/2009_aapl_data.xlsx
aapl_data/2010_aapl_data.xlsx
aapl_data/2011_aapl_data.xlsx
aapl_data/2012_aapl_data.xlsx
aapl_data/2013_aapl_data.xlsx
aapl_data/2014_aapl_data.xlsx
aapl_data/2015_aapl_data.xlsx
aapl_data/2016_aapl_data.xlsx
aapl_data/2017_aapl_data.xlsx
aapl_data/2018_aapl_data.xlsx
aapl_data/2019_aapl_data.xlsx


In [16]:
#Let's check if our table exists within the database.
db.tables

['aapl']

In [17]:
#Let's take a look at the header
header = sheet[1]
header[0].value 

'date'

Because we need to give the columns a type when we create them, let's create a function to output true or false if our data are listed as the right type for each column. <b>Recall, all of the columns are floating point except for 'date' which is a datetime type.</b><br>
<p>NOTE: <br>
isdigit() will return True if the string is an integer, but will give False if there is a decimal point. <br>
The float() function will throw a ValueError if there are letters in the string.<br>
The dates always have '/' in them.<br>
Strings will output as Unicode.</p>

In [18]:
def isfloat(value):
  try:
    float(value)
    return True
  except ValueError:
    return False

# Note: These have to be tested in the right order. isfloat() reports True for integers.
def get_type(value):
    if value.isdigit():
        return dataset.types.Integer
    elif isfloat(value):
        return dataset.types.Float
    elif '/' in value:
        return dataset.types.Date
    else:
        return dataset.types.Unicode

Now let's create the columns of the database!<br>
We will be using the create_Column function available in the dataset library.<br>
The name will be the name of the column in the database, and type corresponds to the Dataset definition from our get_type() function that will return the Dataset type.

In [20]:
#Let's test the function on row 2.
row2 = sheet[2]
for cell in row2:
    print(f'{cell.value} is {get_type(cell.value)}')

2019/09/04 is <class 'sqlalchemy.sql.sqltypes.Date'>
209.1900 is <class 'sqlalchemy.sql.sqltypes.Float'>
19216820.0000 is <class 'sqlalchemy.sql.sqltypes.Float'>
208.3900 is <class 'sqlalchemy.sql.sqltypes.Float'>
209.4800 is <class 'sqlalchemy.sql.sqltypes.Float'>
207.3200 is <class 'sqlalchemy.sql.sqltypes.Float'>


In [21]:
# Remember: header contains the first row cells that are column names
# the enumerate function outputs a number to the variable index
# index 
for index, col_name in enumerate(header):
    table.create_column(col_name.value, get_type(row2[index].value))

In [22]:
#Let's double check our work and output the column with headers.
table.columns

['id', 'date', 'close', 'volume', 'open', 'high', 'low']


In final, we will need to conduct a query to count the number of rows in the worksheet and the number of rows in the database. This should confirm that the data generated accordingly within the table.

In [23]:
#Check the number of rows in database.
print(f'Rows in database: {len(table)}')

Rows in database: 2516


## Part 2

Now that you have a working database with a reasonable amount of data in it, do some queries with it and show the data:

1. Find all days where the stock closed lower than 25. 
    * Print a count of how many
    * Print the first 5 rows found
2. Find all days in 2017 where the stock closed above 35.
    * Print a count of how many
    * Print the last 5 found.
    
**Deliverable:**

3. Create a new workbook and put each query result on a new worksheet in the workbook. Remember to save it to disk.


In [1]:
#Create a function to print all the rows of a worksheet, called print_rows().
#Then we will create a new worksheet and query our database.
def print_rows():
    for row in sheet.iter_rows(values_only=True):
        print(row)

In [3]:
#Load the following required libraries
from openpyxl import Workbook
from openpyxl.chart import BarChart, Reference

In [4]:
import dataset
db = dataset.connect("sqlite:///stock_prices.db")
print(db.tables)

['aapl']


In [5]:
# Just show everything
for i, row in enumerate(db['aapl']):
    if i < 10:
        print(row)

OrderedDict([('id', 1), ('date', datetime.date(2019, 9, 4)), ('close', 209.19), ('volume', 19216820.0), ('open', 208.39), ('high', 209.48), ('low', 207.32)])
OrderedDict([('id', 2), ('date', datetime.date(2019, 9, 3)), ('close', 205.7), ('volume', 20059570.0), ('open', 206.43), ('high', 206.98), ('low', 204.22)])
OrderedDict([('id', 3), ('date', datetime.date(2019, 8, 30)), ('close', 208.74), ('volume', 21162560.0), ('open', 210.16), ('high', 210.45), ('low', 207.2)])
OrderedDict([('id', 4), ('date', datetime.date(2019, 8, 29)), ('close', 209.01), ('volume', 21007650.0), ('open', 208.5), ('high', 209.32), ('low', 206.655)])
OrderedDict([('id', 5), ('date', datetime.date(2019, 8, 28)), ('close', 205.53), ('volume', 15957630.0), ('open', 204.1), ('high', 205.72), ('low', 203.32)])
OrderedDict([('id', 6), ('date', datetime.date(2019, 8, 27)), ('close', 204.16), ('volume', 25897340.0), ('open', 207.86), ('high', 208.55), ('low', 203.53)])
OrderedDict([('id', 7), ('date', datetime.date(2019

In [6]:
# Let's format it a bit
for i, row in enumerate(db['aapl']):
    if i < 10:
        print(f"{row['date']} {row['close']} {row['volume']} {row['open']} {row['high']} {row['low']}")

2019-09-04 209.19 19216820.0 208.39 209.48 207.32
2019-09-03 205.7 20059570.0 206.43 206.98 204.22
2019-08-30 208.74 21162560.0 210.16 210.45 207.2
2019-08-29 209.01 21007650.0 208.5 209.32 206.655
2019-08-28 205.53 15957630.0 204.1 205.72 203.32
2019-08-27 204.16 25897340.0 207.86 208.55 203.53
2019-08-26 206.49 26066130.0 205.86 207.19 205.0573
2019-08-23 202.64 46882840.0 209.43 212.051 201.0
2019-08-22 212.46 22267820.0 213.19 214.435 210.75
2019-08-21 212.64 21564750.0 212.99 213.65 211.6032


Query Deliverable 1:
    1. Find all days where the stock closed lower than 25. 
    * Print a count of how many
    * Print the first 5 rows found

In [7]:
#Create a low_close variable and only return those <= to 25.
low_close = db['aapl'].find(close = {'<=': 25})
for row in low_close:
    vals = [v for k, v in row.items()]
    print(vals)

[417, datetime.date(2009, 9, 14), 24.8171, 80383404.0, 24.4043, 24.8428, 24.3214]
[418, datetime.date(2009, 9, 11), 24.5943, 87108026.0, 24.7014, 24.74, 24.41]
[419, datetime.date(2009, 9, 10), 24.6514, 122612107.0, 24.58, 24.75, 24.4014]
[420, datetime.date(2009, 9, 9), 24.4486, 202624511.0, 24.6828, 24.9243, 24.2428]
[421, datetime.date(2009, 9, 8), 24.7043, 78524974.0, 24.7114, 24.7343, 24.5714]
[422, datetime.date(2009, 9, 4), 24.33, 93309888.0, 23.8966, 24.3857, 23.87]


In [8]:
#Take the low_close values and input into an excel sheet.
low_close = db['aapl'].find(close = {'<=': 25})
excel_rows = []

# Get the headers
# excel_rows.append(db['aapl'].columns[1:])

for row in low_close:
    vals = [v for k, v in row.items()]
    excel_rows.append(vals[1:])

#Print the first rows.
excel_rows[:5]

[[datetime.date(2009, 9, 14), 24.8171, 80383404.0, 24.4043, 24.8428, 24.3214],
 [datetime.date(2009, 9, 11), 24.5943, 87108026.0, 24.7014, 24.74, 24.41],
 [datetime.date(2009, 9, 10), 24.6514, 122612107.0, 24.58, 24.75, 24.4014],
 [datetime.date(2009, 9, 9), 24.4486, 202624511.0, 24.6828, 24.9243, 24.2428],
 [datetime.date(2009, 9, 8), 24.7043, 78524974.0, 24.7114, 24.7343, 24.5714]]

In [9]:
workbook = Workbook() # Create a new workbook
sheet = workbook.active # Get the active worksheet

header = db['aapl'].columns[1:]
# The header row comes from the db as a weird type
# so we convert to string
header = [str(v) for v in header]
sheet.append(header)
for row in excel_rows:
    sheet.append(row)

In [10]:
#Save the excel spreadsheet.
lfname = "lowclose_aapl.xlsx"

workbook.save(filename=lfname)

Query Deliverable 2:
2. Find all days in 2017 where the stock closed above 35.
    * Print a count of how many
    * Print the last 5 found.

In [11]:
#Load the datatime and parse modules and the utilities required for the next query.
from datetime import datetime
from dateutil.parser import parse

In [12]:
#Use the parser utility to filter for the dates in 2017 and closing higher than 35.
theDate = parse("2017-12-25")
high_close = db['aapl'].find(close = {'>=': 35}, date = {'between': ['2017-01-01', '2017-12-31']})

# date={'between':['Fri, 1 Dec 2000 00:00:00 -0800 (PST)','Fri, 1 Dec 2000 00:30:00 -0800 (PST)']})
for i, row in enumerate(high_close):
    if i < 10:
        print(f"{row['date']} {row['close']} {row['volume']} {row['open']} {row['high']} {row['low']}")

2017-12-29 169.23 25938760.0 170.52 170.59 169.22
2017-12-28 171.08 16412270.0 171.0 171.85 170.48
2017-12-27 170.6 21477380.0 170.1 170.78 169.71
2017-12-26 170.57 33113340.0 170.8 171.47 169.679
2017-12-22 175.01 16339690.0 174.68 175.424 174.5
2017-12-21 175.01 20848660.0 174.17 176.02 174.1
2017-12-20 174.35 23451420.0 174.87 175.42 173.25
2017-12-19 174.54 27393660.0 175.03 175.39 174.09
2017-12-18 176.42 29385650.0 174.88 177.2 174.86
2017-12-15 173.97 40122100.0 173.63 174.17 172.46


In [13]:
#Take the high_close values and input into an excel sheet.
high_close = db['aapl'].find(close = {'>=': 35}, date = {'between': ['2017-01-01', '2017-12-31']})
excel_rows = []

# Get the headers
# excel_rows.append(db['aapl'].columns[1:])

for row in high_close:
    vals = [v for k, v in row.items()]
    excel_rows.append(vals[1:])

excel_rows[:5]

[[datetime.date(2017, 12, 29), 169.23, 25938760.0, 170.52, 170.59, 169.22],
 [datetime.date(2017, 12, 28), 171.08, 16412270.0, 171.0, 171.85, 170.48],
 [datetime.date(2017, 12, 27), 170.6, 21477380.0, 170.1, 170.78, 169.71],
 [datetime.date(2017, 12, 26), 170.57, 33113340.0, 170.8, 171.47, 169.679],
 [datetime.date(2017, 12, 22), 175.01, 16339690.0, 174.68, 175.424, 174.5]]

In [14]:
workbook = Workbook() # Create a new workbook
sheet = workbook.active # Get the active worksheet

header = db['aapl'].columns[1:]
# The header row comes from the db as a weird type
# so we convert to string
header = [str(v) for v in header]
sheet.append(header)
for row in excel_rows:
    sheet.append(row)

In [15]:
#Save the excel spreadsheet.
hfname = "2017highclose_aapl.xlsx"

workbook.save(filename=hfname)

In closing, I learned a lot. Keeping It Simple Stupid (KISS), (is a challenge for this Stupid). Sorry if this assignment proved simple enough to make me over think all my code. At least it's complete and I learned enough to get me going next time.   