## Time series analyzing : 
This notebook first tries to use [RedMed](https://github.com/alavertu/redmed)
 [1] to detect all opioid-related words in comments. Then consider the time series of using different substances in the comments. Once we have constructed related time series, we try to analyze them with various mathematical tools. 

In [1]:
import pandas as pd
from psaw import PushshiftAPI
import datetime as dt
import gc 
import redmed
import numpy as np

In [2]:
#defining red-med model
tagger = redmed.redmedTagger()

In [3]:
drugs = ['heroin', 'morphine', 'fentanyl', 'oxycodone']

## Reading all month's data: 
This section goes through all files which are fetched using the previous notebook. Each step stores the number of comments related to each substance (determined by RedMed), and then we will normalize them. 

In [4]:
# import required module
import os
 
# assign directory
directory = 'reddit_data'

#arrays 
months = []
n_whole = []
n_whole_repeat = []
n_heroin = []
n_morphine = []
n_fentanyl = []
n_oxycodone = []

 
#saving months
for filename in os.scandir(directory):
    if filename.is_file():
        if str(filename)[-5:-2] != 'csv':
            continue
        print(str(filename))
        df = pd.read_csv(filename)
        #saving the month
        months.append(str(filename)[11:-6])
        #number of all opioids 
        n_whole.append(len(df))
        #number of each substances
        n_heroin_ = 0
        n_morphine_ = 0
        n_fentanyl_ = 0
        n_oxycodone_ = 0
        all_txts = df['body'].values.tolist()
        all_txts = [str(txt) for txt in all_txts]
        for j in range(len(df)):
            main_str = all_txts[j]
            #using RedMed to detect opioid related drugs
            try:
                if 'heroin' in tagger.get_mention_counts(main_str):
                    n_heroin_ += tagger.get_mention_counts(main_str)['heroin']

                if 'morphine' in tagger.get_mention_counts(main_str):
                    n_morphine_ += tagger.get_mention_counts(main_str)['morphine']

                if 'fentanyl' in tagger.get_mention_counts(main_str):
                    n_fentanyl_ += tagger.get_mention_counts(main_str)['fentanyl']

                if 'oxycodone' in tagger.get_mention_counts(main_str):
                    n_oxycodone_ += tagger.get_mention_counts(main_str)['oxycodone']
            except:
                'text isn’t english'
        n_whole_repeat.append(n_oxycodone_+n_fentanyl_+n_morphine_+n_heroin_)
        n_heroin.append(n_heroin_)
        n_morphine.append(n_morphine_)
        n_fentanyl.append(n_fentanyl_)
        n_oxycodone.append(n_oxycodone_)

<DirEntry '2019-1.csv'>
<DirEntry '2020-5.csv'>
<DirEntry '2020-4.csv'>
<DirEntry '2019-2.csv'>
<DirEntry '2020-6.csv'>
<DirEntry '2020-7.csv'>
<DirEntry '2019-3.csv'>
<DirEntry '2020-10.csv'>
<DirEntry '2019-7.csv'>
<DirEntry '2020-3.csv'>
<DirEntry '2020-2.csv'>
<DirEntry '2019-6.csv'>
<DirEntry '2020-11.csv'>
<DirEntry '2019-4.csv'>
<DirEntry '2020-1.csv'>
<DirEntry '2019-5.csv'>
<DirEntry '2020-12.csv'>
<DirEntry '2021-6.csv'>
<DirEntry '2021-7.csv'>
<DirEntry '2021-5.csv'>
<DirEntry '2021-4.csv'>
<DirEntry '2021-1.csv'>
<DirEntry '2021-3.csv'>
<DirEntry '2021-2.csv'>
<DirEntry '2019-10.csv'>
<DirEntry '2019-11.csv'>
<DirEntry '2019-12.csv'>
<DirEntry '2021-9.csv'>
<DirEntry '2021-12.csv'>
<DirEntry '2021-8.csv'>
<DirEntry '2021-11.csv'>
<DirEntry '2021-10.csv'>
<DirEntry '2019-8.csv'>
<DirEntry '2019-9.csv'>
<DirEntry '2020-9.csv'>
<DirEntry '2020-8.csv'>


## constructing the main time series

In [5]:
#normlizing time series 
n_fentanyl_nr = [n_fentanyl[i]/n_whole_repeat[i] for i in range(len(n_fentanyl))]
n_heroin_nr = [n_heroin[i]/n_whole_repeat[i] for i in range(len(n_heroin))]
n_morphine_nr = [n_morphine[i]/n_whole_repeat[i] for i in range(len(n_morphine))]
n_oxycodone_nr = [n_oxycodone[i]/n_whole_repeat[i] for i in range(len(n_oxycodone))]

In [6]:
months = pd.DatetimeIndex(months)

In [7]:
#storing time series 
time_series = pd.DataFrame(columns=['date','fentanyl','heroin','morphine','oxycodone'])
time_series['date'] = pd.DatetimeIndex(months)
time_series['oxycodone'] = n_oxycodone
time_series['morphine'] = n_morphine
time_series['heroin'] = n_heroin
time_series['fentanyl'] = n_fentanyl

In [8]:
#storing time series (normlized )
time_series_nr = pd.DataFrame(columns=['date','fentanyl','heroin','morphine','oxycodone'])
time_series_nr['date'] = pd.DatetimeIndex(months)
time_series_nr['oxycodone'] = n_oxycodone_nr
time_series_nr['morphine'] = n_morphine_nr
time_series_nr['heroin'] = n_heroin_nr
time_series_nr['fentanyl'] = n_fentanyl_nr

In [9]:
#sorting by date 
time_series.sort_values('date',inplace=True)
time_series_nr.sort_values('date',inplace=True)

## Visualization

In [12]:
import plotly.express as px
df = px.data.stocks()
fig = px.line(time_series, x="date", y=time_series.columns,
              hover_data={"date": "|%B %d, %Y"}
              )
fig.update_xaxes(
    dtick="M1",
    tickformat="%b\n%Y")
fig.update_layout(
    title='time series',
    yaxis_title="# comments",
    legend_title="Category",

)
fig.write_html('n_comments-timeseries.html')
fig.show()

In [13]:
import plotly.express as px
df = px.data.stocks()
fig = px.line(time_series_nr, x="date", y=time_series_nr.columns,
              hover_data={"date": "|%B %d, %Y"},
              title='normlized time series')
fig.update_layout(
    title='time series',
    yaxis_title="# comments (normlized)",
    legend_title="Category",

)
fig.write_html('n_comments(normlized)-timeseries.html')
fig.show()

## To do: 
1. **EDA**
2. **time series analysis (dynamic behavior of time series using Taken's embedding theorem)**
3. **anomaly analysis**
