# Topic categorization example

In this tutorial, I will show how keywords can be used to add topic categories to a set of tweets using [**nlpru**](https://github.com/sergegoussev/nlpru)

----------------

First, import requirements. For data, I will use a sample collected from a Twitter via the [GET statuses/sample API endpoint](https://developer.twitter.com/en/docs/tweets/sample-realtime/overview/GET_statuse_sample) normalized and stored in a MySQL [database schema]((https://github.com/sergegoussev/Twitter_analysis/tree/master/SQL). The method to connect to the database and extract data will use the [**pysqlc**](https://github.com/sergegoussev/pysqlc) library.

In [1]:
import pandas as pd
from nlpru import FindTopics

from pysqlc import DB
db = DB('kremlin_tweets_db')

Successfully connected to kremlin_tweets_db database


Now lets extract some data. We will focus on a specific day captured by the *GET statuses/sample* method and saved in the `samp_twts_all_rus_twts_str` table:

In [2]:
def __get_data__(start_date, end_date):
    """
    Collect the data -- as tweets and retweets are stored separately, collect via inner joins and then
    use UNION to append. 
    """
    q = """
     SELECT 
        tmast.twttext as twttext,
        tsamp.twtid,
        tmast.userid, 
        twt_createdat, 
        imrev3
    FROM samp_twts_all_rus_twts_str tsamp 
        INNER JOIN twt_Master tmast 
        ON tsamp.twtid=tmast.twtid

        LEFT JOIN meta_all_users_communities com
        ON tmast.userid=com.userid

    WHERE tmast.twt_lang='ru'  
    AND tmast.twt_createdat >= '{start}'
    AND tmast.twt_createdat < '{end}'

    UNION ALL

    SELECT
        tmast.twttext AS twttext,
        tsamp.twtid,
        trts.userid,
        twt_createdat,
        imrev3
    FROM samp_twts_all_rus_twts_str tsamp
        INNER JOIN twt_rtmaster trts
        ON tsamp.twtid=trts.twtid
        INNER JOIN twt_master tmast
        ON trts.rttwtid=tmast.twtid
        
        LEFT JOIN meta_all_users_communities com
        ON tmast.userid=com.userid

    WHERE tmast.twt_lang='ru' 
    AND tmast.twt_createdat >= '{start}'
    AND tmast.twt_createdat < '{end}';
    """.format(start=start_date,
               end=end_date)
    raw = db.query(q)
    print("There are {:,} tweets in the captured sample!".format(len(raw)))
    return raw

Specify *March 26th* as the day we want to focus on ([the day of massive protests in Russia](https://en.wikipedia.org/wiki/2017%E2%80%932018_Russian_protests#26_March_2017))

In [3]:
start_date='2017-03-26'
end_date='2017-03-27'
raw = list(__get_data__(start_date=start_date, end_date=end_date))

There are 44,613 tweets in the captured sample!


Now lets pick a few keywords that would qualify for the protest action happening on this day

In [4]:
#Lets say we pick the following keywords:
keywords1 = "россия, москва, митинг, навальный, задержать, против, акция, полицейский, димонответить, димон, протест, коррупция"
keywords = keywords1.split(", ")

First we call the `FindTopics` object and give it the necessary [*parameters*](https://github.com/sergegoussev/nlpru/blob/master/docs/methods.md#topic-analysis), namely the tweets that we want to classify:

In [5]:
T = FindTopics(
    tweet_list=raw,
    tweet_text_index=0,
    tweet_id_index=1)

Now we use the method, i.e. `.Keyword_Match()` and specify the topics we want as a distionary-list pair:

In [6]:
r = T.Keyword_Match({'protests':keywords})

That's it! the output of `r` will be a dictionary with the following information:
```python
'twtid':{
    'clean_words': ['list of clean words'],
    'other': ['whatever other inputs you specified. In this case the tweet id and the created_at time stamp'],
    'text': 'text of the actual tweet',
    'topic': 'topic of the tweet as identified'
}
```

Or in our case:

```python
'tweet id': {
    'clean_words': ['...list of clean words...'],
      'other': [
          'user id',
           datetime.datetime(time stamp tweet created at),
           int(community (imrev3))],
      'text': '...text of the actual tweet...',
      'topic': '...topic text label ...'
      }
```

In [11]:
df = pd.DataFrame.from_dict(r, orient='index')
df.reset_index(inplace=True)
df[["index","topic"]].groupby("topic").count()/df["index"].count()*100

Unnamed: 0_level_0,index
topic,Unnamed: 1_level_1
none detected,84.878508
protests,15.121492


Great! Using the keyword method, we have now detected that slightly over 15% of our tweets were on the topic we were listening for!