# Topic Categorization with converstation thread effects

When detecting topics in a document, common ways include simple *keyword* matching, *topic modeling*, and many others. While this works fine for large text documents like news articles, applying this type of approach to social media data has a serious *methodological* flaw: posts are not isolated but usually part of a conversation thread. Hence if one post is detected as being on the topic, it is logical that another post that replies to it is also on the topic, however this second post might not have used the listened to *keywords* and hence was not tagged as on the topic. 

This notebook walks through the problem and shows how **nlpru**'s `Recategorize_topics` method by taking thread affects simplifies this analysis.

---------------

## The problem space 

On Twitter, we can say that there are 3 potential scenarios for tweets in the context of a convesation thread:
![Conversation_thread_visual.PNG](Conversation_thread_visual.PNG)

1. **Topics flow down in threads, not up**: the first scenario is quite simple -- a tweet replies to another tweet. So for instance, if **tweet 1** is categorized as being on a certain topic, then logically every replying tweet is also on the topic (**tweets 2-5**). 
    * If **tweet 3** is on the topic, then **tweet 4** is as well, but not the others (**1, 2, or 5**)
    * *NOTE*: This is obviously a bit of a simplification and depends on how a topic is defined. A reply can be on a separate topic, especially if the topics being analyzed are quite close logically, then the transition is harder to determine. More on this later...
    
2. **Adding text/comment while retweeting also has to be taken into account**: if a user retweets a tweet, they have the option to *Retweet with comment* -- which has its own 'tweet' characteristics and is linked (and displayed) with the retweeted tweet as embedded below. In the Twitter API, the retweeted tweet is called a `quoted_status`. Hence if a *quoted tweet*, for instance **tweet 10** is categorized on the topic, then the *commenting tweet* or **tweet 9**, must also be on the topic; and
   
3. **As regularly retweeted tweets have their own unique tweet id, this also has to be taken into account**: Twitter stores a reply relationship in the original (i.e. retweeted) tweet, not the tweet that retweeted it. As we often investigate tweet topics by also including the tweets that retweeted others as copies of the original, we need to take this into account.
    * In other words, say we want to create a picture of the topics that were discussed during a particular day. We would pull all the tweets and then pull all the retweets that were made during that day, and plot them by topic per hour. This means that we pull the `retweet tweet id`, `retweet created at`, and `retweeted tweet text`. In this case, we need to check if the retweeted tweet was a reply to another tweet that was determined to be on a specific topic!

## Conceptual solution

As such we have 4 inputs:
* the original tweets pre-categorized, for instance using the `nlpru.Keyword_Match` method;
* a list of tweets that reply to other tweets: `list_replies = [('twtid','inreplytotwtid'),...]`;
* a list of retweeted tweet ids that retweeted a previous tweet: `list_rts = [('twtid','rttwtid'),...]`; and
* a list of quotes, i.e. when one tweet quoted another: `list_quotes = [('twtid','qttwtid'),...]`

The output should be like the pre-categorized list of tweets, but the topic labels should be changed.

Conceptually, there are **two ways** to solve the problem:
1. Using the tweets that **are categorized** about the topic
    * In other words, for each tweet that is a about a topic, use the conversation thread linkages based on certain rules to tag all related or **downstream** tweets as also about the topic. This will mean that a conversation thread is *virtually* created for each tweet coded about the topic.
2. The reverse, starting with the tweets **not categorized** as about the topic
    * This approach will require an iteration through all uncategorized tweets, checking each that the rules and conversation thread linkages allow the tweet to be recategorized as about the topic. This step is repeated again and again until tweets are **no longer** being recategorized. 
    
While we need a thorough test of efficiency to know for sure, option 1 requires building a conversation thread *object* first, and using it to categorize tweets. As this method does not currently exist, this will be done in a later section. *(Note, some like @fionapigott have created [conversation thread builders](https://github.com/fionapigott/conversation-builder), but they only work with *replies*, and not *quotes*, as there is often an interelation of quotes that starts separate conversation threads)*

This workbook will hence follow the 2nd option in solving the problem

## Approach

The solution will be based on the following steps/rules:
1. Create a dictionary for each convesation relationship, such as `replies_dict = {'twtid':'inreplytotwtid',...}`
2. Create a dictionary for all tweets within the sample, and a sub-dictionary that contains the necessary parameters (topic, etc). For instance `tweet_dict = {'twtid':{'topic':'protests','twt_text':'bla bla bla','userid':'123456'}, ...}`
3. Iterate through all tweets **not** on the topic, and using the convesation relationships from #1, find the the tweets that each tweet refers to (i.e. the *parent* to the *child*)
4. Use the dictionary of tweets and topics from #2, to check which topic the *parent* tweet is on, and if its on the topic, change the topic of the *child*
5. Continue until no more tweets are recategorized with each loop

> The full code of the solution can be seen in the [conversation.py](../nlpru/conversation.py) file

Great! 

Now lets see how to use this method using **nlpru**. We first need to:
1. Categorize some tweets about a topic (we duplicate the steps described in the [
Topic categorization example](nlpru_topic_categorization_walkthrough.ipynb))
2. Prepare the `replies`, `retweets`, and `quotes` lists

### 1. Get some data and pre-categorize tweets as about a topic

-----------------------------

In [1]:
import pandas as pd
from nlpru import FindTopics

from pysqlc import DB
db = DB('kremlin_tweets_db')

Successfully connected to kremlin_tweets_db database


In [2]:
def __get_data__(start_date, end_date):
    """
    Collect the data -- as tweets and retweets are stored separately, collect via inner joins and then
    use UNION to append. 
    """
    q = """
     SELECT 
        tmast.twttext as twttext,
        tsamp.twtid,
        tmast.userid, 
        twt_createdat, 
        imrev3
    FROM samp_twts_all_rus_twts_str tsamp 
        INNER JOIN twt_Master tmast 
        ON tsamp.twtid=tmast.twtid

        LEFT JOIN meta_all_users_communities com
        ON tmast.userid=com.userid

    WHERE tmast.twt_lang='ru'  
    AND tmast.twt_createdat >= '{start}'
    AND tmast.twt_createdat < '{end}'

    UNION ALL

    SELECT
        tmast.twttext AS twttext,
        tsamp.twtid,
        trts.userid,
        twt_createdat,
        imrev3
    FROM samp_twts_all_rus_twts_str tsamp
        INNER JOIN twt_rtmaster trts
        ON tsamp.twtid=trts.twtid
        INNER JOIN twt_master tmast
        ON trts.rttwtid=tmast.twtid
        
        LEFT JOIN meta_all_users_communities com
        ON tmast.userid=com.userid

    WHERE tmast.twt_lang='ru' 
    AND tmast.twt_createdat >= '{start}'
    AND tmast.twt_createdat < '{end}';
    """.format(start=start_date,
               end=end_date)
    raw = db.query(q)
    print("There are {:,} tweets in the captured sample!".format(len(raw)))
    return raw

Specify *March 26th* as the day we want to focus on ([the day of massive protests in Russia](https://en.wikipedia.org/wiki/2017%E2%80%932018_Russian_protests#26_March_2017))

In [5]:
start_date='2017-03-26'
end_date='2017-03-27'
raw = __get_data__(start_date=start_date, end_date=end_date)

There are 44,613 tweets in the captured sample!


Now add some keywords and classify this data as *about* a topic or *not* based on a match of **at least one** keyword:

In [6]:
#Lets say we pick the following keywords:
keywords1 = "россия, москва, митинг, навальный, задержать, против, акция, полицейский, димонответить, димон, протест, коррупция"
keywords = keywords1.split(", ")

In [7]:
T = FindTopics(
    tweet_list=raw,
    tweet_text_index=0,
    tweet_id_index=1)
r = T.Keyword_Match({'protests':keywords})

`r` is outputted as the following key-value *dictionary* pair:
```python
'tweet id': {
    'clean_words': ['...list of clean words...'],
      'other': [
          'user id',
           datetime.datetime(time stamp tweet created at),
           int(community (imrev3))],
      'text': '...text of the actual tweet...',
      'topic': '...topic text label ...'
      }
```

To check what proportion of tweets is on the topic, lets convert it to a dataframe and calculate %s:

In [8]:
df = pd.DataFrame.from_dict(r, orient='index')
df.reset_index(inplace=True)
df[["index","topic"]].groupby("topic").count()/df["index"].count()*100

Unnamed: 0_level_0,index
topic,Unnamed: 1_level_1
none detected,84.878508
protests,15.121492


Hence, aprox 15% of them are categorized as being about the topic we picked based on keywords. 

-----------------------

This is the benchmark -- from this, we can add conversation thread affects

### 2. Prepare the conversation thread linkages

Get the *list of replies* for this sample and date range

In [9]:
repl_q = """
SELECT 
    repl.twtid, 
    inreplytotwtid
FROM meta_repliesmaster repl
    INNER JOIN samp_twts_all_rus_twts_str samp
    ON repl.twtid=samp.twtid
    
    INNER JOIN twt_master tm
    ON tm.twtid=repl.twtid
    
WHERE tm.twt_createdat >= '{start_date}'
AND tm.twt_createdat < '{end_date}'
""".format(start_date=start_date, end_date=end_date)
reply_list = db.query(repl_q)

Now get the *retweets* for this sample and date range

In [10]:
retweet_q = """
SELECT 
    rt.twtid,
    rttwtid
FROM twt_rtmaster rt
    INNER JOIN samp_twts_all_rus_twts_str samp
    ON rt.twtid=samp.twtid
    
WHERE rt.rt_createdat >= '{start_date}'
AND rt.rt_createdat < '{end_date}'
""".format(start_date=start_date, end_date=end_date)
retweet_list = db.query(retweet_q)

Now get the *quotes* for this sample and the date range

In [11]:
quote_q = """
SELECT 
    qt.twtid,
    qttwtid
FROM twt_qtmaster qt
    INNER JOIN samp_twts_all_rus_twts_str samp
    ON qt.twtid=samp.twtid
    
    INNER JOIN twt_master tm
    ON tm.twtid=qt.twtid
    
WHERE tm.twt_createdat >= '{start_date}'
AND tm.twt_createdat < '{end_date}'
""".format(start_date=start_date, end_date=end_date)
quote_list = db.query(quote_q)

## Test the model

Now that all data is assembled, lets try to run the model and see what we get

In [13]:
from nlpru import Conversations

c = Conversations(
    reply_list=reply_list,
    retweet_list=retweet_list,
    quote_list=quote_list)

In [14]:
t = c.Recategorize_topics(topic_for_which_to_check="protests", tweet_dict=r)

1 iteration completed, recategorized this round: 388
2 iteration completed, recategorized this round: 17
3 iteration completed, recategorized this round: 0


We see that only 3 iterations were required -- in fact the last one wasn't even needed! 

On the first round 388 tweets needed recategorizing based on coversation affects. Only 17 were reclassified in the second round.

In [15]:
df_postconvos = pd.DataFrame.from_dict(t, orient='index')
df_postconvos.reset_index(inplace=True)
df_postconvos[["index","topic"]].groupby("topic").count()/df_postconvos["index"].count()*100

Unnamed: 0_level_0,index
topic,Unnamed: 1_level_1
none detected,83.970681
protests,16.029319


The method added 1 more percent of conversation that discussed the topic