-
Notifications
You must be signed in to change notification settings - Fork 0
/
Documentation of the Sigmoid Event Detection Algorithm
113 lines (77 loc) · 7.17 KB
/
Documentation of the Sigmoid Event Detection Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
####################################################
Documentation of the
Sigmoid Event Detection Algorithm
Mathew Teoh and Chandan Yeshwanth
July 2015
https://github.com/teoh/SigmoidEventDetection
####################################################
Thank you for using the Sigmoid Event Detection Algorithm.
####################################################
----- Sections -----
1. Requirements
2. Setup
3. Overview of the Algorithm
4. Details on how to use the Algorithm
5. Known Issues / Work for the Future
####################################################
----- 1. Requirements -----
To use the sigmoid event detection, make sure you have the following installed:
- MySql Community Server (server version 5.6.25 or newer should be sufficient)
- Python (version 2.7.6 or newer should be sufficient)
- R (version 3.2.1 or newer should be sufficient)
The following are optional:
- RStudio (version 0.99.447 or newer should be sufficient)*
####################################################
----- 2. Setup -----
After having installed the above requirements, set up a database with the name "twitter". Run "Setup/create.sql" and then "Setup/alter.sql" in that order. This will create the table where all of the English tweets will be stored.
####################################################
----- 3. Overview of the Algorithm -----
The algorithm has three main steps, each with their own substeps:
I. Preprocessing
i. Retreiving English tweets from the selected time period
ii. Tokenising tweets, removing stop words
iii.Getting IDF time series for each word
iv. Removing words with small standard deviation in IDF
II. Transformation to 4 tuple
i. For each word / IDF series, fit a sigmoid curve with tuple of parameters (a1,a2,a3,a4)
III.Get event keywords and related words
i. Classify tuple as keyword / non-keyword
ii. Get co-occurring keywords
iii.For each keyword, group it with its co-occurrences to form a detected event
####################################################
----- 4. Details on how to use the Algorithm -----
When excecuting these scripts, make sure that you are in the top most directory, that is the one called "SigmoidEventDetection".
The steps in this section will closely follow that of the algorithm overview in the previous section. The directories in SigmoidEventDetection contain either input/output data, or scripts that produce these data. The numbering of these directories will loosely follow the numbering of the algorithm overview as well.
I.i. Getting the English tweets
- Niagarino has a class that retrieves files of Twitter data from the storage and retrieves the English tweets. In order for the English tweets to be in the proper format, use the included file "I_i_RetrieveEnglTweets/Engl_TweetWriter.java" to replace the "EnglishTweetWriter.java" usually used in Niagarino.
- When retrieving English tweets, you will most likely be interested in tweets that span a certain time interval. The comments in the "Engl_TweetWriter.java" file at the top of the main method describes how to set these time intervals.
- Run the "main" method in that class to generate the output for the next step. An example of such output can be found in "I_i_EnglTweet/2015_06_01_02_en.csv". For legal reasons, this file is not available on the GitHub repo.
- Note that the files in the Twitter storage are named according to CET, but the times written in the files are in GMT.
I.ii-iv. Tokenising, removing stop words, computing IDF, removing small stddev
- loadading the .csv files into MySql: (if you have done this already for you .csv files you can skip this)
- Change the working directory to the one which contains the files outputted by step I.i.
- Run "I_ii-iv_TokeniseAndIDF/load.sh". You may need to change the MySql username and password arguments in that script.
- Now your English tweets are loaded into MySql
- Change the working directory back to "SigmoidEventDetection/"
- Input file:
- The file "I_ii-iv_TokeniseAndIDF/input" contains the input to "I_ii-iv_TokeniseAndIDF/main.py". Following the example data included, it may look like this:
2015_06_01_02
2015-06-01 01:00
2015-06-01 02:00
- The first line describes event name (if you are testing the algorithm), or the general name of the time period (see description in "Engl_TweetWriter.java"). In this example above, note that the first line has its time in CET.
- The second and third lines are the start and end times (respectively) of tweets in GMT
- Change according to dates used in I.i., adjusting for GMT in the last two lines
- Optional: Run "select count(*) from tweet_en where creation_date between <start_date> and <end_date> group by hour(creation_date), minute(creation_date );" in MySql (e.g. <start_date> would be in the form of '2015-06-01 01:00', including quotes). MySql should report a number of rows close to the number of minutes spanning your desired time interval.
- Making sure that your working directory is "SigmoidEventDetection/", run: "./I_ii-iv_TokeniseAndIDF/main.py < ./I_ii-iv_TokeniseAndIDF/input"
- Details on how main.py and the other scripts in this step work can be found in the comments of these files
- The output of this script will be a .csv file with a header, where each row is a word, followed by its IDF values for each minute in the specified time interval.
II.i For each word / IDF series, fit a sigmoid curve with tuple of parameters (a1,a2,a3,a4)
- If you are missing the package 'zoo' in R, please install that package before following the steps below.
- Run the command "./II_i_FitSig/isdSeriesToTuples.r filename" to fit the curves to the IDF values for each word contained in "filename.csv"
- The R script assumes your word-IDF series files are kept in "./I_ii-iv_WordsIdfSeries/" directory. This algorithm requires further development, so it does not work perfectly in all cases. Thus, the file "zayn_1d.csv" is included to show how this R script works
- Output directory is "./II_i_IdfTup/"
- Overall this script looks at each word and its series of IDF values, and tries to fit a sigmoid curve to it. The fitted sigmoid curve can be described with four parameters, which precisely the four numbers listed beside each word in the output file for this script.
III.i Classify tuple as keyword / non-keyword
- To train the support vector classifier, run "./III_i_ClassifTup/supportVecIDF_createMod.py". This stores the trained model in "./III_i_ClassifTup/modelInfo" based on the training data given in "./III_i_ClassifTup/modelInfo/wordIdfTuples.csv". If you have already trained the classifier, you dont need to do this again, unless you have new data to train with.
- "./III_i_ClassifTup/modelInfo/wordIdfTuples.csv" stores the training data for the classifier. The first two column was used for other purposes and is not used. "a1" to "a4" are the parameters for the sigmoid curve generated in the same order as the script in II.i. "IsSigmoid" is a boolean ("TRUE" or "FALSE") that describes if the corresponding curve is the type of sigmoid curve we are looking for.
- To use the classifier, run "./III_i_ClassifTup/supportVecIDF_test filename", where "filename" is the same filename you used in step "II.i".