### Studying CSS patterns of Indian Speakers in Code-Mixed Tweets

In this notebook, we will learn one of the most useful ways of processing Code-Mixed text from twitter (by utilizing word-level language identification), and then analyze initial patterns in the data that consists of tweets from/about Indian speakers.

### Table of Contents

1. Downloading and Preparing Data
2. Installing and Running LID-tool
3. Running LID on the Tweets
4. Analyzing Code-Mixing Patterns in Tweets
5. Resources


**Note:** Please click on "Runtime" in the File Menu above and select "Factory reset runtime" before following the tutorial. And select "Save a Copy in Drive" before running the notebook.

### Downloading and Preparing Data

#### 1. About Data

- We are using a custom dataset that was created for this class specifically. It has 510 tweets from/about Indian speakers.
- We have sampled some code-mixed English-Hindi tweets in roman script from [Silent Flame's NER Code-Mixed Corpus](https://github.com/SilentFlame/Named-Entity-Recognition).
- We have extracted tweets related to the query `india` from Twitter, using [twitter's developer API](https://developer.twitter.com/en/docs/twitter-api/tweets/search/introduction).
- **Note:** None of this data is owned by us. It is not used or intended to be used for commercial purposes. It's being used for educational/scholarly purposes and available openly along with the code. If you decide to use it for commercial purposes, please reach out to the above parties for consent/license.

Let's download the data from GitHub!

In [None]:
!wget https://raw.githubusercontent.com/mohdsanadzakirizvi/plaksha_rasa/main/assignments/plaksha_tweet_data_1.csv

#### 2. Exploring the Data

It's time to do an initial exploration of the data.

In [None]:
# read data
import pandas as pd

tweetdf = pd.read_csv('plaksha_tweet_data_1.csv')
tweetdf

Let's have a closer look at some of the tweets

In [None]:
for idx, twt in enumerate(tweetdf['Tweet'][36:41]):
  print(idx, ':', twt)

In [None]:
print(tweetdf['Tweet'][36])

We notice that there are certain inconsistencies in the raw tweets, let's do a basic round of preprocessing to fix that. We will normalize the text by first converting everything to lower case and then removing URLs and non-alphabetic characters.

In [None]:
import re

def text_cleaner(text):
    #converting to lowercase
    newString = text.lower()
    #removing links
    newString = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', newString) 
    #fetching alphabetic characters
    newString = re.sub('[^a-zA-Z#@]', ' ', newString)
    return newString

As you can see, the below tweet is much more cleaner than before. It's time to do the same for all the tweets in the dataset,

In [None]:
print(text_cleaner(tweetdf['Tweet'][36]))

In [None]:
tweetdf['cleaned_text'] = tweetdf['Tweet'].apply(text_cleaner)

tweetdf.head()

Now that we've cleaned the tweets, let's build a word cloud and see what kind of word frequency distribution we have in the underlying data.

In [None]:
#Import libraries
import matplotlib.pyplot as plt
from wordcloud import WordCloud

#Generate word cloud
total_text = " ".join(tweetdf.cleaned_text.tolist())
wc = WordCloud(width=400, height=330, max_words=100, background_color='white').generate_from_text(total_text)
plt.figure(figsize=(12, 8))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

The above visualization gives a rudimentary overview of topic/word distribution in the data. We can see some of the topics that are recurring in the data. We also notice that lot of high-frequency words are code-mixed words from Hindi written in roman script.

### Installing LID-tool



#### About LID-tool 

- It is a word level language identification tool for identifying Code-Mixed text of languages (like Hindi etc.) written in roman script and mixed with English.

- At a broader level, we utilize a ML classifier that's trained using [MALLET](https://mimno.github.io/Mallet/index) to generate word level probabilities for language tags. We then utilize these probabilities along with the context information of surrounding words to generate language tags for each word of the input. 

- We also use hand-crafted dictionaries as look-up tables to cover unique, corner and conflicting cases to give a robust language identification tool. 

- Note: LID is shorthand for [Language Identification](https://en.wikipedia.org/wiki/Language_identification).

Now that we've learned about the tool, let's install it!


In [None]:
!git clone https://github.com/microsoft/LID-tool.git

In [None]:
!ls LID-tool/

In [None]:
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip mallet-2.0.8.zip

In [None]:
!pip install twitter-text-python

In [None]:
!mkdir LID-tool/mallet-2.0.8
!cp -r mallet-2.0.8 LID-tool/

In [None]:
%cd LID-tool/

In [None]:
!mv dictionaries/dict1hi.txt dictionaries/dicthi.txt
!touch dictionaries/dict1hi.txt

!mv dictionaries/dict1hinmov.txt dictionaries/dicthinmov.txt
!touch dictionaries/dict1hinmov.txt

If you've run all the above cells without an error then your setup of LID-tool is complete!

### Running LID on the Tweets

Now that we have installed LID-tool, let's run it on a bunch of sample sentences that are given in the tool. The following are the sentences we want to be language tagged:

In [None]:
!cat sampleinp.txt

#### 1. Initial run on sample tweets



In [None]:
!rm sampleinp.txt_tagged

In [None]:
!python getLanguage.py sampleinp.txt

Notice that in the third sentence, the words `main` and `kya` have high English probabilities even though they are actually Hindi words in that sentence's context. While the same word `main` is actually an English word in the context of fourth sentence and accordingly has high English probabilities.

Let's see what this means for out final language tag output:

In [None]:
!cat sampleinp.txt_tagged

As we can see, the LID-tool has performed well by tagging `main` in th fourth sentence as EN. But, it has mistaked the words `main` and `kya` as EN in the third sentence. 

All is not bad though, since it has correctly predicted the word `karoon` as HI in the third sentence with high probability.

Let's see how we can improve the above output by utilizing the concept of dictionaries.

#### 2. Improving the output by utilizing dictionaries

The LID-tool has two major components: an ML model trained on n-gram features and a couple of word lists (dictionaries) in both the languages. 

![](https://raw.githubusercontent.com/microsoft/LID-tool/main/images/info_flow_new_lid.PNG)



Now let's make slight change in our dictionaries. Go to `LID-tool/dictionaries/dict1hinmov.txt` file and the words "kya" and "main" in separate lines one at a time. Then let's rerun the pipeline and check what changes.

In [None]:
!rm sampleinp.txt_tagged
!rm dictionaries/memoize_dict.pkl

In [None]:
!python getLanguage.py sampleinp.txt

In [None]:
!cat sampleinp.txt_tagged

The addition of dictionaries in the project was an engineering decision that was taken after considering the empirical results, which showed that the dictionaries complemented the performance of the ML-based classifier (MALLET) for certain corner-cases. 

Here are some of the problems that this method solved:

**1.Dealing with “common words” that can belong to either of the languages.**

For example, the English word `“to”` is one of the ways in which the Hindi word `“तो”` or `“तू”` is spelt when written in `roman script` so the word “to” will be classified differently in the following two sentences:


**Input:**       I have to get back to my advisor

**Output:**    I/EN have/EN ***to/EN*** get/EN back/EN to/EN my/EN advisor/EN 

**Input:**       Bhai to kabhi nahi sudhrega

**Output:**    Bhai/HI ***to/HI*** kabhi/HI nahi/HI sudhrega/HI


In this case, we make sure that the word “to” is present in both the dictionaries and the LID is supposed to focus more on the combination of ML probabilities and Context (surrounding words) to tag the language. For instance, the probability values "1e-09" and "0.999999999" indicate the word is present in dictionary(s).

**2.Words that surely belong to only one language.** 

For example, words like “bhai”, “nahi”, “kabhi” in Hindi and words like “advisor”, “get” etc. in English. In this case, we utilize the relevant dictionary to force tag it to the correct language even if the ML classifier says otherwise.

So the questions that you have to ask yourself while creating the dictionaries are: 

1. “Are there certain words that can be spelt the same way in both the languages?” And, 
2. “Are there common words in one language that surely can’t be used in the other language?”

These are just a couple of things that we looked at while building this tool, but given your specific use-case you can consider more such engineering use-cases and customize the dictionaries accordingly. 

- [Implementation in Code](https://github.com/microsoft/LID-tool/blob/f6528ebd8ac77b561f10d3659799f3f8894deb09/getLanguage.py#L147)
- [More details in Docs](https://github.com/microsoft/LID-tool/blob/main/Train_Custom_LID.md)

### Analyzing Code-Mixing Patterns in Tweets

#### Getting Language Tagged Tweets

Due to the limitations of free version of Colab and in the interest of time, we have already computed language tags for all the tweets in the dataset. Let's fetch them!

In [None]:
!wget https://raw.githubusercontent.com/mohdsanadzakirizvi/plaksha_rasa/main/assignments/tagged_tweets.txt

In [None]:
with open('tagged_tweets.txt') as fp:
  tagged_tweets = fp.readlines()

tagged_tweets[:5]

In [None]:
len(tagged_tweets)

#### 1. Basic aggregation metrics

Let's find out the number of tweets in each of the categories: Mostly En, Mostly Hi and Code-Mixed. We will also cluster the tweets later based on it.

In [None]:
# threshold of 70% of total word length
thres = 0.7

outdf = []

# aggregate word frequencies
for twt in tagged_tweets:
  en_fr, hi_fr = 0, 0
  if not twt.startswith('##'):
    tags = twt.split()
    taglen = len(tags)
    for t in tags:
      if 'HI' in t:
        hi_fr += 1
      if 'EN' in t:
        en_fr += 1
    
    if hi_fr/taglen >= thres:
      outdf.append((twt, 0, 1, 0))
    elif en_fr/taglen >= thres:
      outdf.append((twt, 1, 0, 0))
    else:
      outdf.append((twt, 0, 0, 1))

# create the output dataframe
outdf = pd.DataFrame(outdf, columns = ['Tweet', 'EN', 'HI', 'CM'])
outdf.head()

So now we have tagged each tweet based on whether it has mostly EN, HI or Code-Mixed words. 

**P.S.** You can play with the threshold value `thresh` given above and see how things change. Go crazy!

Let's now aggregate how many tweets fall in each category!

In [None]:
outdf.sum()

Do you want to see tweets falling in a particular category? Just replace the 'CM' with the required tag in code in the following line:

In [None]:
outdf[outdf['CM'] == 1]

#### 2. Extracting user mentions and hashtags

We now have a general idea of the distribution of EN, HI and CM heavy tweets.
 Let's try to find some explaination of what might be the reason of tweets having the prominence of one of the above languages.

In order to do that, we can take advantage of one information that we have present in our tweets: hashtags (starting with `#` and user mentions (starting with `@`)

In [None]:
tagdf = []

# extract user mentions and tags
for twt in tagged_tweets:
  hashtags, users = [], []
  if not twt.startswith('##'):
    tags = twt.split()
    taglen = len(tags)
    for t in tags:
      if 'OTHER' in t:
        tok = t.split('/')[0]
        if t.startswith('@'):
          users.append(tok)
        if t.startswith('#'):
          hashtags.append(tok)
    tagdf.append((twt, hashtags, users))

# create the output dataframe
tagdf = pd.DataFrame(tagdf, columns = ['Tweet', 'Hashtags', 'Users'])
tagdf.head()

#### 3. Hypothesis

Now that we have tagged the tweets by both the prominent language and the hashtags + user mentions, let's merge the two dataframes and look at the bigger picture!

In [None]:
finaldf = outdf.merge(tagdf)

finaldf.sample(20)

- On running the above script multiple times, we notice that there are patterns dictating when a tweet has Code-mixing vs. when it's just all English or all Hindi.

- One clear pattern is that the tweets that are from formal settings like those by news agencies or organisations (@reuters, @geologytime, @ani etc.) tend to use pure English in their tweets even when they are talking about incidents in India. It is true for even Indian news agencies.

- Conversely, we see lot of Code-Mixed words or majority of Hindi words in tweets that are by individual accounts. These are everyday folks just expressing their honest opinion on certain topics (having hashtags of major incidents in Indian society). Since, Code-mixing occurs naturally to them in their speech and day to day informal interactions, there's a tendency to do the same online. Which is interesting because they aren't just going for all Hindi words or even an all Hindi script but are heavily mixing both the script and words with Roman English.

- Another interesting pattern is that some organisations tend to use Code-Mixing when they are advertising a product or service (ex. @jio). Which is probably a marketing hack of making the customer feel "closer" to the product by speaking in "their colloquial language" which is usually full of Code-Mixing.

- The above set of early hypothesis corroborates with previous studies of Code-Mixing that it's a spoken language phenomenon that tends to happen more in informal contexts like speech, social media and chat apps like Whatsapp, Messenger etc.

- Please note that the above are but qualitative observations, we can perform more rigorous quantitative analyses to come up with solid conclusions from the data.

- Do you see any other patterns? Thoughts?

### Resources

You can refer the following resources for more details about the project:

- [GitHub Page](https://github.com/microsoft/LID-tool)
- [Papers](https://github.com/microsoft/LID-tool#papers)