# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 3 Assignment: Data Preparation in Pandas**

**Victor Macia**

# Assignment Instructions

For this assignment, you will use the **series-31** dataset.  This file contains a dataset that I generated explicitly for this semester.  You can find the CSV file on my data site, at this location: [series-31](https://data.heatonresearch.com/data/t81-558/datasets/series-31.csv). Load and summarize the data set.  You will submit this summarized dataset to the **submit** function.  See [Assignment #1](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb) for details on how to submit an assignment or check that one was submitted.

The RAW datafile looks something like the following:


|time|value|
|----|-----|
|8/22/19 12:51|    19.19535862|
|9/19/19 9:44|13.51954348|
|8/26/19 14:05|9.191413297|
|8/19/19 16:37|18.34659762|
|9/5/19 9:18|1.349778007|
|9/2/19 10:23|8.462216832|
|8/23/19 15:05|17.2471252|
|...|...|

Summarize the dataset as follows:

|date|starting|max|min|ending|
|---|---|---|---|---|
|8/19/19|17.57352208|18.46883497|17.57352208|18.46883497|
|8/20/19|19.49660945|19.84883044|19.49660945|19.84883044|
|8/21/19|20.0339169|20.0339169|19.92099707|19.92099707|
|...|...|...|...|...|

* There should be one row for each unique date in the data set.
* Think of the **value** as a stock price.  You only have values during certain hours and certain days.
* The **date** column is each of the different dates in the file.
* The **starting** column is the first **value** of that date (has the earliest time).
* The **max** column is the maximum **value** for that day.
* The **min** column is the minimum **value** for that day.
* The **ending** column is the final **value** for that day (has the latest time).

You can process the **time** column either as strings or as Python **datetime**.  It may be necessary to use Pandas functions beyond those given in the class lecture.

Note, you might get the following warning on the date field from the API.  You can safely ignore this warning:

* Warning: The mean of column date differs from the solution file by 2010.4. (might not matter if small)

Your submission triggers this warning due to the method you use to convert the time/date.  Your code is correct, whether you get this warning or not.

In [1]:

# My solution

# Project 3  Deep neural networks - Washington University in St. Louis 2020

import pandas as pd
import datetime
import time


# Loading data set

df = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/series-31.csv")
df['time'] = pd.to_datetime(df['time'], errors='coerce')
df2 = pd.read_csv("https://data.heatonresearch.com/data/t81-558/datasets/series-31.csv")

# Ordering time and index reset

df = df.sort_values('time').reset_index(drop = True)

# Find max and min for each date 

M = df.groupby(df["time"].dt.to_period('D'), as_index=False).value.max()
m = df.groupby(df["time"].dt.to_period('D'), as_index=False).value.min().rename(columns = {'value':'min'})

df = pd.DataFrame(df.groupby(by=df['time'].dt.date).value.max()).rename(columns = {'value':'max'})


# adding minimum values

m.index = df.index

df = pd.concat([df, m], axis=1, ignore_index=True)
df = df.rename(columns={0: "max", 1: "min"})

#  Time access 
#  
#  Some ideas: if you have a pandas dataframe with strings representing times
#  we can turn a string into 'times' using dt.date (over the column on the frame)
#
#  To access times we can use dt.second dt.hour dt.month etc 
#  The values of this function are not floats.
#
#  If we want an array with floats we have to write: df.time.dt.second.array
#
#  Searching for values
#
#  df2.time[df2.time.dt.year == 2018] values year 2018
#
#  Several conditions at the same time
#
#  df[(df['column1']==value) | (df['columns2'] == 'b') | (df['column3'] == 'c')
#
#  Using the operators & or |, 'and' or 'or'.
#
#  Only dates without hours df2.time.dt.date
#
#  (df2.time.dt.year)
#
#  #df2[df2.time.dt.date == datetime.date(2019, 9, 19)].time.dt.time
#

#df2.time[(df2.time.dt.year == 2019) & (df2.time.dt.month == 8)]

# Finding first and last value for each day.

# Ideas: order values, reindex, then taking first and last value for each day

df2 = df2.sort_values(by = 'time')
df2 = df2.reset_index(drop=True)

df2.time = pd.to_datetime(df2.time)

unicos = sorted(list(set(df2.time.dt.date)))


first = len(unicos)*[0]
last = len(unicos)*[0]

# Extracting first and last value for each data

for i in range(0,len(unicos)):
        first[i] = df2[df2.time.dt.date == unicos[i]].value[df2[df2.time.dt.date == unicos[i]].index[0]]
        last[i] = df2[df2.time.dt.date == unicos[i]].value[df2[df2.time.dt.date == unicos[i]].index[-1]]

df['ending'] = last
df['starting'] = first

df['date'] = df.index
df = df[['date', 'ending', 'max', 'min', 'starting']].reset_index(drop = True)
df

Unnamed: 0,date,ending,max,min,starting
0,2019-08-19,18.468835,18.468835,17.573522,17.573522
1,2019-08-20,19.84883,19.84883,19.496609,19.496609
2,2019-08-21,19.920997,20.033917,19.920997,20.033917
3,2019-08-22,18.89359,19.3937,18.89359,19.3937
4,2019-08-23,16.974865,17.784213,16.974865,17.784213
5,2019-08-26,8.814028,9.717627,8.814028,9.717627
6,2019-08-27,7.175677,7.630234,7.175677,7.630234
7,2019-08-28,7.091136,7.091136,6.940731,6.940731
8,2019-08-29,8.150454,8.150454,7.677676,7.677676
9,2019-08-30,9.441207,9.441207,8.989132,8.989132
