# Detecting events in Timestamp data

Author: Titus de Jong (tydejong@uci.edu)

Course Project, UC Irvine, Math 10, Summer 2023

## Introduction

When walking through a student research section, I noticed a few posters that stuck out-- One covered how to get rid of all useless or unimportant data, and the other covered event detection using machine learning techniques. While they both go hand-in-hand, I figured that it would be more efficient to focus on event detection (the second project), allbeit a little simplified from the research I saw. Thus I tried to implement different Machine Learning techniques such as Polynomial Regressions and Decision Tree Regressions to try to find so-called 'events,' where some abnormality exists that impacted data. I then plotted results and analyzed them.

### Data Imported

While, in theory, these programs should run for for any dataset, I used publicly available Stock value sheets, notably the [Dow Jones Index](https://www.investing.com/indices/us-30-historical-data) fund, to verify results. With these data, I already knew that there would be events (such as the 2008 Housing Market Crisis, the 2020 Covid Crash, and the 2022 Ukraine war Crash), which made it easier to know that such an event happened, and was detected. 

## Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression, LinearRegression
import altair as alt
from sklearn.tree import DecisionTreeRegressor

## Pre-Processing Data

This code is more specialized for each data set, as it prepares the datasets I use to be processed.

In [2]:
df = pd.read_csv('Dow data.csv')
df = df.dropna()
df['Date'] = pd.to_datetime(df['Date'])
df['Price'] = list(map(lambda x: x.replace(',',''), df['Price']))
df['Price'] = pd.to_numeric(df['Price'])


raw_data = alt.Chart(df).mark_line().encode(
    x = 'Date:T',
    y = alt.Y('Price', axis=alt.Axis(tickCount=4)),
    tooltip = ['Date', 'Price']
    
).properties(
    width=200,
    height=200,
)
raw_data


## Functions and Implementation of Algorithms

### Basic Function( No Machine Learning)

I just found the top 5% and the bottom 5% of data, also adding in random data to  show a general shape of the curve.

In [3]:
def Basic(data, columny):

    basic_data = data.sort_values(by = columny, ascending = False)

    
    (height, width) = basic_data.shape
    Percent5 = height//20
    top = basic_data.head(Percent5).copy()
    top['event'] = 'Max data'
    bot = basic_data.tail(Percent5).copy()
    bot['event'] = 'Min data'
    ran = basic_data.sample(frac = .2).copy()
    ran['event'] = 'Random data'

    out_data = pd.concat([top, ran, bot], axis = 0)

    
    return out_data

### Polynomial Regression

Implementing a 3rd order polynomial to find a 'rolling average' of the data, then selecting the 5%. of data furthest away from the curve.

In [4]:
def Poly_regression(data, columnx, columny):

    poly_reg_data = data
    poly_reg_data['1st Order'] = list(range(poly_reg_data.shape[0]))
    poly_reg_data['1st Order'] = poly_reg_data['1st Order']+1
    poly_reg_data['2nd Order'] = list(map(lambda x: x**2, poly_reg_data['1st Order']))
    poly_reg_data['3rd Order'] = list(map(lambda x: x**3, poly_reg_data['1st Order']))

    pred_col = ['1st Order', '2nd Order', '3rd Order']
    clf = LinearRegression()
    clf.fit(poly_reg_data[pred_col], poly_reg_data[columny])
    
    predicted_out = clf.predict(poly_reg_data[pred_col])
    predicted_out = list(map(lambda x: float(x), predicted_out))
    dy_percent = abs((poly_reg_data[columny]-predicted_out))
    poly_reg_data['difference'] = dy_percent
    poly_reg_data = poly_reg_data.sort_values(by = 'difference', ascending = False)

    Percent5 = poly_reg_data.shape[0]//20
    top = poly_reg_data.head(Percent5).copy()
    top['event'] = 'Max data'
    ran = poly_reg_data.sample(frac = .2).copy()
    ran['event'] = 'Random data'

    out_data = pd.concat([top, ran])

    out_score = clf.score(poly_reg_data[pred_col], poly_reg_data[columny])
    return out_data, out_score

### Decision Tree Regression

Implementing a Decision Tree Regression from Sci-kit learn, then finding any outlying data.

In [5]:
def Decision_Tree(data, columnx, columny, depth = 5):

    Decision_data = data
    input_data = np.array(Decision_data['Date'])
    input_data = np.reshape(input_data, (-1, 1))

    fit1 = DecisionTreeRegressor(max_depth = depth)
    fit1.fit(input_data, Decision_data[columny])
    Decision_data['pred_y'] = fit1.predict(input_data)

    predicted_out = list(map(lambda x: float(x), Decision_data['pred_y']))
    dy_percent = abs((Decision_data[columny]-Decision_data['pred_y']))
    Decision_data['difference'] = dy_percent
    Decision_data = Decision_data.sort_values(by = 'difference', ascending = False)

    Percent5 = Decision_data.shape[0]//20
    top = Decision_data.head(Percent5).copy()
    top['event'] = 'Max data'
    bot = Decision_data.tail(Percent5).copy()
    bot['event'] = 'Min data'
    ran = Decision_data.sample(frac = .2).copy()
    ran['event'] = 'Random data'

    out_data = pd.concat([top, ran, bot])
    out_score = fit1.score(input_data, Decision_data[columny])
    return out_data, out_score

## Results

### Using the functions

In [6]:
Basic1 = Basic(df, columny = 'Price')
(Poly1, Poly_score) = Poly_regression(df, columnx = 'Date', columny = 'Price')
(Decision1, Decision_score) = Decision_Tree(df, columnx = 'Date', columny = 'Price', depth = 5)

### Scores

In [7]:
print(Poly_score)
print(Decision_score)

## Poly_score is not overfit, and Decision_score is not significant

0.959506698371601
-0.8728662548391584


### Output Graphs

In [8]:
basic_chart = alt.Chart(Basic1).mark_line().encode(
    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),

).properties(
    title = 'Basic processing (Without Machine learning)',
    width = 200,
    height = 200
)

poly_chart = alt.Chart(Poly1).mark_line().encode(
    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),
    
).properties(
    title = 'With a Polynomial Regression',
    width = 200,
    height = 200
)

decision_chart = alt.Chart(Decision1).mark_line().encode(
    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),

).properties(
    title = 'With a Decision Tree',
    width = 200,
    height = 200
)

(basic_chart & decision_chart)| poly_chart 

While the above cells show a rough representation of the data, the following graphs will show how it relates to the original imported data.

In [9]:

basic_chart_overlay = alt.Chart(Basic1[Basic1['event'].isin(['Max data', 'Min data'])]).mark_circle(opacity = 1).encode(
    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),
    tooltip = ['Date', 'Price']
).properties(
    title = 'Basic processing (Without Machine learning)',
    width = 200,
    height = 200
)

poly_chart_overlay = alt.Chart(Poly1[Poly1['event']== 'Max data']).mark_circle(opacity = 1).encode(
    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),
    tooltip = ['Date', 'Price']
).properties(
    title = 'With a Polynomial Regression',
    width = 200,
    height = 200
)

decision_chart_overlay = alt.Chart(Decision1[Decision1['event']=='Max data']).mark_circle(opacity = 1).encode(

    x = 'Date:T',
    y = 'Price',
    color = alt.Color('event'),
    tooltip = ['Date', 'Price']
).properties(
    title = 'With a Decision Tree',
    width = 200,
    height = 200
)

raw_data_overlay = alt.Chart(df).mark_line(color = 'red', opacity = .2).encode(
    x = 'Date:T',
    y = 'Price',
).properties(
    width = 200,
    height = 200
)

graph_basic = basic_chart_overlay + raw_data_overlay
graph_poly = poly_chart_overlay+raw_data_overlay
graph_decision = decision_chart_overlay+raw_data_overlay

(graph_basic & graph_decision) | graph_poly


As is shown, without any processing, the basic algorithm may capture the mins and maxes, but can not exptrapolate whether an event actually occured. For all the user knows, the Dow Jones would have stayed at the same price between 2003 and 2009. However, when looking at both the polynomial regression and the decision tree algorithms, the user gains a better understanding that there were irregularites. 

## Interpreting the results

## Introduction

Now that the irregularities have been singled out, I wondered whether it would be possible to Highlight when the irregularities are, and single out a handful of datapoints where that is shown. Rather, it automatically interprets the data for us. Immediately I wanted to try a clustering algorithm, where each cluster represents a different 'percieved' event. This immediately backfired for multiple reasons, so I made my own code to comb through the different datavalues from above.

### Data Used

This Algorithm uses the data already produced in the previous code. However, I will modify the data to seperate the data. After the processing above, the code produces points that it thinks are events and then also collects data that is random to give a vague gauge of the function, thus I will seperatre out the event points. I will label them as:

Basic_cleaned = Basic1
Poly_cleaned = Poly1
Decision_cleaned = Decision1



In [10]:
Basic_cleaned = Basic1[Basic1['event'] != 'Random data']
Poly_cleaned = Poly1[Poly1['event'] != 'Random data']
Decision_cleaned = Decision1[Decision1['event'] != 'Random data']

## Importing Libraries

In [11]:
import pandas as pd
import numpy as np
import altair as alt

## Functions and Implementation of Algorithms

### Basic Function and Calling

In [12]:
def event_detect(data, sel_cols = ['Date', 'Price']):
    event_data = data
    event_data = event_data.sort_values(by = 'Date')

    list1 = event_data['Price'].tolist()[1::]
    list2 = event_data['Price'].tolist()[:-1:]
    event_data['slope'] = [0]+ [list1[i]-list2[i] for i in range(len(list1))]

    event_data = event_data.sort_values(by = ['slope'], ascending = False)
    Percent20 = event_data.shape[0]//5
    top_20 = event_data.head(Percent20).copy()

    return top_20

events_Basic = event_detect(Basic_cleaned)
events_Poly = event_detect(Poly_cleaned)
events_Decision = event_detect(Decision_cleaned)



## Results

### Basic Plots

In [13]:
basic_chart_event = alt.Chart(events_Basic).mark_circle(opacity = 1, color = 'black').encode(
    x = 'Date:T',
    y = 'Price',
    tooltip = ['Date', 'Price']
).properties(
    title = 'Basic processing (Without Machine learning)',
    width = 200,
    height = 200
)

poly_chart_event = alt.Chart(events_Poly).mark_circle(opacity = 1, color = 'black').encode(
    x = 'Date:T',
    y = 'Price',
    tooltip = ['Date', 'Price']
).properties(
    title = 'With a Polynomial Regression',
    width = 200,
    height = 200
)

decision_chart_event = alt.Chart(events_Decision).mark_circle(opacity = 1, color = 'black').encode(

    x = 'Date:T',
    y = 'Price',
    tooltip = ['Date', 'Price']
).properties(
    title = 'With a Decision Tree',
    width = 200,
    height = 200
)



### Plots in Reference to Data

In [14]:
basic_events = basic_chart_event + raw_data_overlay
poly_events = poly_chart_event + raw_data_overlay
decision_events = decision_chart_event+raw_data_overlay

(basic_events & decision_events) | poly_events

In the above graphs, it shows a handful of points that the machine thinks is significant towards the data. These points denote that an event occured and show when, with what relevant prices. The basic processing and the polynomial regression showed almost none of these events, or at least the events I consider the most significant. The Decision Tree, on the other hand, showed that there was an event in 2007 before the market crashed, there was an event in late 2019, when the Covid Crash started, and that the economy was increasing after again. Thus it seems that out of the different methods tested, the Decision Tree had the highest accuracy.

## Conclusion

## Part 1: Detecting Events

While a larger assortment of method could have been used, I tried to contrain myself to these three to see not only if it was possible, but rather the different pros and cons of different algorithms. I picked the decision tree regression algorithm as I thought its ability to curve fit would be great for this project, the polynomial regression because of how it follows the center of data and allows for more outliers, and the basic processing which was just for reference. When planning the process, I realized that the project wouldn't have to worry too much about overfitting, unless a curve was overfitted and no data points could be extracted. For the regression, for example, I limited the regression possibilities to a 3rd order polynomial to prevent overfitting and to allow those outlying values, and as seen in the decision tree, I had to use different values of depth to make sure that the tree was underfitted, so that the outlying data could be collected. Regardless, I took special steps to prevent overfitting the data. This project also did not necessarily require test data. While yes, a test set is important, this program dealt much more with classification rather than regression. That means that the code 'classified' each of the marked-dots as outliers compared to the rest of the data. With that in mind, the program produced great results when looking at both the polynomial regression and the decision tree. As seen in the graphs above, all of the main points/biggest outliers were selected for both the polynomial regression and the decision tree regression, so the code worked well.

## Part 2: Interpreting Data

I used a hand-made code to find the difference in price between nearest neighbors, to find the most significant events using the data produced in Part 1. The data produced was more subpar than what was expected. While, yes, the decision tree algorithm was able to correctly see 2/3 of the big events I was searching for, the fact that it missed one and that the polynomial regression processing, did mean that my code was faulty. I believe that more can be done to better select the points that are most significant, such as using a new algorithm that takes relative difference rather than actual distance. I did, however, try to use k-means and spectral clustering algorithms, but no matter how I tweaked them, I never got data that I was satisfied with, leading to the code I made above.

## Overall

The code was mostly success in being able to determine which data. values in a given subset were outliers from the rest of the data set. Future tweaks can be implemented to further refine the decisions of the machine.

## References

For my data: I used the [Dow Jones index fund](https://www.investing.com/indices/us-30-historical-data).

Helpful References:
[Sci-Kit learn-Official Documentation](https://scikit-learn.org/stable/)
[Pandas Official Documentation](https://pandas.pydata.org/docs/)
[Altair Official Documentation](https://altair-viz.github.io/getting_started/overview.html)
[Numpy Official Documentation](https://numpy.org/doc/)
[Stack Overflow](https://stackoverflow.com/)
Math 10 Course Notes

__Because of an NDA, I am not able to source the original inspiration for my project__

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f403051e-7163-4fbc-8e26-5d957540ed70' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>