# Capstone project report
## Description
My capstone project Sentimeter is Python-written unsupervised tool that, given a phenomenon in question, allows to learn the attitudes of different social groups to it. For this, I first sample the users who spoke up about the phenomenon on Twitter, using its API from module **twitter**. I tokenize and clean the tweets using TweetTokenizer class, dictionary of English stop words, and other Natural Language Processing tools from module **nltk**. I then compile a corpus of texts, each corresponding to some user and being a bag-of-words representation of all their tweets, and run topic modeling on the corpus using Latent Dirichlet Allocation (module **gensim**). This extracts the most persistent topics mentioned by the sampled users, and shows how much each user is affiliated with each topic. To be more specific, it constructs a basis made of topics and represents the users as multidimensional points with respect to this basis. Next, I run k-means clustering (module **sklearn**) on these datapoints representing users to learn their social groups. The essence of each group is deduced from the average topic representation in its users, i.e. it is considered to be the topic decomposition of centroid of the corresponding cluster. Finally, I compute group attitude to the phenomenon in question as the average sentiment of tweets that mention it and are authored by the users in this group. For this, I consider two dimensions of the sentiment, polarity (negative vs positive), and subjectivity (subjective vs objective), and compute both using tools from module **textblob**. One novel component here is normalization of the sentiment score for each user across all their tweets, measuring not the plain sentiment, but how much does it deviate from user's average. I do it to account for individual writing style and patterns of emotional expression, that can differ significantly across Twitter users.

## Business objective
The goal of Sentimeter is to provide insights of who thinks what about anything representable as a key phrase, so demanded in sales, politics, economics and beyond. An obvious example would be the assessment of how favorable or successful some product, service, or event is across the population. By learning the most typical varities of the consumers along with their emotions about it, one can infer whom to target and what strategy to choose for the most value in the future. And this is how Sentimeter could enable high-detailed targeted predictions, possibly defining marketing strategies and political decisions.

## Implementation
I have selected the phenomenon in question to be represented with search term 'nike ad' and limited my sample by 100 users, resulting in 13K tweets. The total size of the data processed was 60 MB.

Fetching the data from Twitter:

In [None]:
search_term = 'nike ad'

api = twitter.Api(consumer_key=consumer_key,
                  consumer_secret=consumer_secret,
                  access_token_key=access_token_key,
                  access_token_secret=access_token_secret,
                  sleep_on_rate_limit=True,
                  tweet_mode='extended')
results = api.GetSearch(term=search_term, lang='en', result_type="recent")

Tokenizing and cleaning the tweets from stop words, emoticons, etc.:

In [None]:
tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=True, reduce_len=True, strip_handles=True)
stop_words = set(nltk.corpus.stopwords.words('english')).union(
    get_stop_words('en') + ['rt', 'http', 'via', 'inc'])

clean_texts = []
for tweet_text in tweets[text_col]:
    clean_tokens = []
    for token in tokenizer.tokenize(tweet_text):
        if token not in stop_words and not nltk.tokenize.casual.EMOTICON_RE.search(token):
            clean_tokens.append(re.sub(r"[^a-zA-Z0-9_']|^'|'$", "", token))
    clean_texts.append(" ".join(token.lower() for token in clean_tokens if len(token) > 1))
tweets[clean_text_col] = clean_texts

Vectorizing texts representing the users by comprising of all their tweets:

In [None]:
dictionary = gensim.corpora.Dictionary(text.split() for text in all_users[text_col])
dictionary.filter_extremes(no_below=5, no_above=0.5)
corpus = [dictionary.doc2bow(text.split()) for text in all_users[text_col]]

Running topic modeling on text corpus using Latent Dirichlet Allocation, with pre-selected number of topics as 10:

In [None]:
# LDA model parameters.
pass_count = 10
topic_count = 10

lda_model = gensim.models.LdaMulticore(corpus, id2word=dictionary, num_topics=topic_count,
                                       workers=3, passes=pass_count)
topic_filename = path.join(data_dir, "topics.csv")
topics = pd.DataFrame({id_col: range(topic_count)})
for i in range(word_count):
    topics["word{}".format(i)] = \
        [dictionary[lda_model.get_topic_terms(j)[i][0]] for j in range(topic_count)]
    topics["weight{}".format(i)] = \
        [lda_model.get_topic_terms(j)[i][1] for j in range(topic_count)]

Clustering users based on their coordinates in topic basis, with pre-selected number of clusters as 3:

In [None]:
# K-means parameter.
cluster_count = 3

lda_corpus = []
for bow in corpus:
    lda_bow = [0] * topic_count
    for i, v in lda_model[bow]:
        lda_bow[i] = v
    lda_corpus.append(lda_bow)

k_means = sklearn.cluster.KMeans(n_clusters=cluster_count).fit(np.array(lda_corpus))

Describing resulting social groups as topic decomposition of their centroids:

In [None]:
centroids = pd.DataFrame({centroid_col: range(cluster_count)})
for i in range(topic_count):
    centroids["topic{}".format(i)] = [c[i] for c in k_means.cluster_centers_]
descriptions = []
for c in k_means.cluster_centers_:
    description = np.sum([c[i] * lda_model.get_topics()[i]
                          for i in range(topic_count)], axis=0)
    word_ids = np.argsort(description)[::-1][:5]
    word_weights = description[word_ids]
    descriptions.append("+".join("{}*{:.3f}".format(dictionary[wrd], wgt)
                                 for wrd, wgt in zip(word_ids, word_weights)))
centroids[description_col] = descriptions

Analysing normalized sentiment score (for both polarity and subjectivity) in relevant tweets accross the social groups:

In [None]:
matched_tweets[tb_pol_col] = matched_tweets[clean_text_col].map(
    lambda text: TextBlob(text).sentiment.polarity)
matched_tweets[tb_pol_mean_col] = matched_tweets[tb_pol_col].groupby(matched_tweets[user_col]).transform("mean")
matched_tweets[tb_pol_std_col] = matched_tweets[tb_pol_col].groupby(matched_tweets[user_col]).transform("std")
matched_tweets[tb_pol_score_col] = 
    (matched_tweets[tb_pol_col] - matched_tweets[tb_pol_mean_col]) / matched_tweets[tb_pol_std_col]

matched_tweets[tb_sub_col] = matched_tweets[clean_text_col].map(
    lambda text: TextBlob(text).sentiment.subjectivity)
matched_tweets[tb_sub_mean_col] = matched_tweets[tb_sub_col].groupby(matched_tweets[user_col]).transform("mean")
matched_tweets[tb_sub_std_col] = matched_tweets[tb_sub_col].groupby(matched_tweets[user_col]).transform("std")
matched_tweets[tb_sub_score_col] = 
    (matched_tweets[tb_sub_col] - matched_tweets[tb_sub_mean_col]) / matched_tweets[tb_sub_std_col]

## Plots
Normalized sentiment scores, averaged across each social group (using module **plotly**):

![sentiment]('nike ad' sentiment.png)

Social group 1, left-most on X-axis, is showing severely negative attitude in their tweets with mentions of "nike ad", and judging by the most significant words of the cluster, its members tend to lean conservative. An additional insight is that people in the group are prone to higher subjectivity when mentioning "nike ad".

Social group 2, in the middle of X-axis, is peculiar by having positive attitude in their tweeted mentions of "nike ad". It can be roughly characterized as "politically active liberals", which becomes more evident after considering its expanded decomposition by words (using module **wordcloud**):

![topic_decomposition](group 2 decomposition.png)

## Conclusion
The "Implementation" section above showcases one particular example of using Sentimeter, aimed at learning the attitudes to the loud Nike ad that came out recently. Any other phenomenon in question can be studied in analogous fashion, requiring only to come up with a relevant search term. And with more time and computing power available, sample size, as well as the number of social groups studied, can be increased, resulting in higher detail of the insights gained.