### Entities' cause commitment in public messaging
This notebook replicates the analysis in the paper: Zhao Wang, Jennifer Cutler, Aron Culotta. ["Are Words Commensurate with Actions? Quantifying Commitment to A Cause from Online Public Messaging"](www.google.com) IEEE International Conference on Data Mining Workshops (ACUMEN: Data science for human performance in social networks), 2017.<br><br>
The goal of this analysis is to identify potential "inauthentic" entities by comparing how do entities show commitment toward a cause in their tweets and in their actions.<br><br>
This notebook has been tested in **Python 3.4.3** only.<br><br>
The code below is exactly what was used to produce the tables and figures in the final version of the paper. See requirements.txt for the versions of external libraries used.<br><br>
**Data**<br>
Please see the paper for the details of data collection. This notebook assumes access to the pre-collected data, which is available here:
[link to dropbox]<br>
Please contact Zhao (zwang185@hawk.iit.edu) for access.<br>
This is about [1GB]. Once you download this data, place it in a folder called data, in the same folder as this notebook.

### 7 sections of implementation and analysis:<br>
**Section 1: select cause-relevant tweets as training data<br>**
> 1.1 Get entities' information<br>
> 1.2 Read tweets for entities<br>
> 1.3 Score and sort tweets by cause-relevance<br>
> 1.4 Select and label each entity's top-n relevant tweets as training data<br>

**Section 2: feature engineering**
- 2.1 Linguistic features<br>
>2.1.1 Sentiment polarity<br>
>2.1.2 Pronouns <br>
>2.1.3 Cause keywords and Context of cause keywords <br>
>2.1.4 Social interactions <br>
>2.1.5 Part-Of-Speach tag <br>
- 2.2 Word embedding features<br>
>2.2.1 Tweet vector and tweet cause relevance score<br>
>2.2.2 Top-n(n=3,5) words, top-n words' vector and top-n words' cause relevance scores<br>
>2.2.3 Cause keywords(sim>=0.30), number of cause keywords, cause keywords' vector, cause words' relevance scores<br>
>2.2.4 Context words (window = 1), context vector, context words' contribution scores<br>

**Section 3: train and evaluate support classifier with manually labeled tweets**
>3.1 Evaluating linguistic features, word embedding features and combination of various features<br>
>3.2 Evaluate different classifiers<br>
>3.3 Analyze terms that have high coefficients<br>

**Section 4: train and evaluate commitment classifier with manually labeled tweets**
>4.1 Evaluating linguistic features, word embedding features and combination of various features<br>
>4.2 Evaluating different classifiers<br>
>4.3 Analyze terms that have high coefficients<br>

**Secrtion 5: apply pre-trained classifiers to predict for unseen tweets**
>5.1 Apply support classifier to classify all brands' tweets into support and non-support classes<br>
>5.2 Apply commitment classifier to classify all brands' support tweets into high- and low- commitment classes<br>

**Section 6: aggregate each entity's cause-commitment tweets and compare with action score to find inauthentic entities**
>6.1 Apply different aggregation methods to select entities that have high word-ratings<br>
>6.2 Sort high word-rating entities by their action-rating and select top-n(high word-rating but low action-rating) as inauthentic entities

**7.Fit linear regression model to analyze how does entities' word commitment level relate with action-ratings**<br><br>

### Implementation
There are 3 datasets with entity&cause pairs: brands with health cause, brands with eco cause, member of congress (moc for short) with eco cause. The code will implement 7 sections for each dataset separately.

In [1]:
import Cause
import nltk, logging, re, operator
from nltk.corpus import stopwords
import numpy as np
from collections import Counter, defaultdict

from sklearn.model_selection import cross_val_score,KFold
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
from sklearn import svm
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier


from __future__ import print_function
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from pprint import pprint
from time import time

%matplotlib inline

logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(message)s')

Using Theano backend.


In [2]:
BRAND_PATH = '/data/2/lip_service/recollect2/'
ECO_SCORE = '/data/2/lip_service/recollect2/brands.csv'
HEALTH_SCORE = '/data/2/zwang/2017_S/Tweet_Health/goodguide_health.csv'
CONGRESS_PATH = '/data/2/zwang/congress/congress_pruned.json'
W2V_PATH = '/data/2/zwang/brand_score/GoogleNews/GoogleNews-vectors-negative300.bin'
eco_term_path = '/data/2/lip_service/eco_terms.txt'
brand_eco_path = '/data/2/zwang/2017_S/Tweet_Eco/brand_eco_labeled_tweets.csv'
brand_health_path = '/data/2/zwang/2017_S/Tweet_Health/brand_health_labeled_tweets.csv'
moc_eco_path = '/data/2/zwang/2017_S/Congress_Eco/congress_eco_labeled_tweets.csv'

In [3]:
eco_cause = "environment"
eco_keywords =["environment","ecosystem","biodiversity","habitats","climate","ecology","plantlife","pollution","rainforests"]

health_cause = "health"
health_keywords =["healthy","nutritious","lowfat","wholesome","organic","natural","vegan"]

In [4]:
eco_terms = Cause.read_eco_terms(eco_term_path)

In [5]:
mycv=KFold(n_splits=10, shuffle=True, random_state=42)

In [6]:
GN_model = Cause.load_GoogleNews_w2v(W2V_PATH)

2017-09-21 10:51:43,847 INFO loading projection weights from /data/2/zwang/brand_score/GoogleNews/GoogleNews-vectors-negative300.bin


Start loading GoogleNews word2vec model.


2017-09-21 10:52:46,552 INFO loaded (3000000, 300) matrix from /data/2/zwang/brand_score/GoogleNews/GoogleNews-vectors-negative300.bin


The vocabulary size is: 3000000


#### Dataset 1: brands with health cause

**[brands-health]** section 1: select cause-relevant tweets as training data

In [68]:
healthbrand_score_dict, healthbrand_sector = Cause.get_brand_info(HEALTH_SCORE,"screen_name","health score")

Get 169 brands with health score


In [7]:
healthbrand_nameid = Cause.get_healthbrand_nameid(ECO_SCORE)

In [8]:
print("Sector distribution of health brands:")
health_sector = Counter()
health_sector.update(healthbrand_sector.values())
health_sector

Sector distribution of health brands:


Counter({'Food': 106, 'Personal Care': 63})

In [9]:
healthbrand_tweets_dict = Cause.read_brand_tweets(BRAND_PATH+"tweets.pruned.json.gz", list(healthbrand_score_dict.keys()),
                                            cause="health")

read 500000 lines
read 1000000 lines
read 1500000 lines
read 2000000 lines
read 2500000 lines
Collected 429009 tweets for 142 health brands in total.


In [9]:
healthbrand_twID_dict, healthbrand_twID_twtext = Cause.dedup_tweets(healthbrand_tweets_dict,cause="health")

processed 100 entities
352160 non-duplicate tweets for 142 health brands


In [10]:
healthbrand_twID_twScore = Cause.score_tweet_by_relevance(healthbrand_twID_twtext,GN_model,health_keywords)

Note: This function takes some time to run. Please run for once, and save results to file.
processed 100000 tweets
processed 200000 tweets
processed 300000 tweets


In [11]:
healthbrand_twIDScore = Cause.sort_tweet_by_score(healthbrand_twID_twScore,healthbrand_twID_dict)

In [12]:
#Cause.select_topn_tweets(filename,healthbrand_twIDScore,healthbrand_twID_twtext,topn=3)

**[brands-health]** section 2: feature enginerring

> Labeled data for support classification

In [10]:
sup_healthbrand_list, sup_healthtweet_list, sup_healthlabel_list = Cause.data_for_sup_clf(brand_health_path,
                                                                                          entity='health-brand')

Read 494 positive instances and 177 negative instances for support classification


In [11]:
sup_health_neg_terms, sup_health_pos_terms = Cause.get_freq_terms(sup_healthtweet_list,sup_healthlabel_list)

In [12]:
print("Most common terms in negative class (non-support):")
sup_health_neg_terms.most_common(20)

Most common terms in negative class (non-support):


[('_URL_', 74),
 ('delicious', 42),
 ('rt', 27),
 ('eat', 18),
 ('chocolate', 16),
 ('milk', 14),
 ('fresh', 14),
 ('food', 13),
 ('amp', 12),
 ('yummy', 11),
 ('_NUMBER_', 11),
 ('make', 9),
 ('eating', 9),
 ('products', 9),
 ('good', 8),
 ('flavors', 8),
 ('new', 8),
 ('fruit', 8),
 ('us', 7),
 ('diet', 7)]

In [13]:
print("Most common terms in positive class (support):")
sup_health_pos_terms.most_common(20)

Most common terms in positive class (support):


[('_URL_', 315),
 ('healthy', 182),
 ('rt', 101),
 ('delicious', 74),
 ('_NUMBER_', 67),
 ('organic', 63),
 ('foods', 59),
 ('natural', 53),
 ('nutritious', 52),
 ('amp', 51),
 ('eat', 44),
 ('_HASHTAG_vegan', 43),
 ('snack', 40),
 ('_HASHTAG_organic', 39),
 ('skin', 39),
 ('ingredients', 37),
 ('free', 35),
 ('food', 33),
 ('_HASHTAG_healthy', 33),
 ('vegan', 32)]

> Labeled data for commitment classification

In [14]:
comt_healthbrand_list, comt_healthtweet_list, comt_healthlabel_list = Cause.data_for_commit_clf(brand_health_path,
                                                                                          entity='health-brand')

Read 238 positive instances and 256 negative instances for commitment classification


In [15]:
comt_health_neg_terms, comt_health_pos_terms = Cause.get_freq_terms(comt_healthtweet_list,comt_healthlabel_list)

In [91]:
print("Most common terms in negative class (low-commitment):")
comt_health_neg_terms.most_common(20)

Most common terms in negative class (low-commitment):


[('_URL_', 161),
 ('healthy', 117),
 ('foods', 47),
 ('rt', 46),
 ('eat', 36),
 ('_NUMBER_', 32),
 ('delicious', 31),
 ('nutritious', 28),
 ('diet', 24),
 ('amp', 23),
 ('skin', 23),
 ('eating', 22),
 ('organic', 22),
 ('snack', 21),
 ('food', 20),
 ('_HASHTAG_vegan', 20),
 ('_HASHTAG_healthy', 17),
 ('healthier', 17),
 ('great', 17),
 ('veggies', 16)]

In [92]:
print("Most common terms in negative class (high-commitment):")
comt_health_neg_terms.most_common(20)

Most common terms in negative class (high-commitment):


[('_URL_', 161),
 ('healthy', 117),
 ('foods', 47),
 ('rt', 46),
 ('eat', 36),
 ('_NUMBER_', 32),
 ('delicious', 31),
 ('nutritious', 28),
 ('diet', 24),
 ('amp', 23),
 ('skin', 23),
 ('eating', 22),
 ('organic', 22),
 ('snack', 21),
 ('food', 20),
 ('_HASHTAG_vegan', 20),
 ('_HASHTAG_healthy', 17),
 ('healthier', 17),
 ('great', 17),
 ('veggies', 16)]

**[brands-health]** section 2.1: linguistic cues:<br>
2.1.1 Sentiment polarity <br>
2.1.2 Pronouns <br>
2.1.3 Cause keywords and Context of cause keywords <br>
2.1.4 Social interactions <br>
2.1.5 Part-Of-Speach tag <br>

In [104]:
print("Bag-of-words feature, serve as baseline.")
sup_BOW_tweet = sup_healthtweet_list
sup_bow_vectorizer,sup_BOW_tw_matrix = Cause.construct_feature_matrix(sup_BOW_tweet)
print(sup_BOW_tw_matrix.shape)

comt_BOW_tweet = comt_healthtweet_list
comt_bow_vectorizer, comt_BOW_tw_matrix = Cause.construct_feature_matrix(comt_BOW_tweet)
print(comt_BOW_tw_matrix.shape)

Bag-of-words feature, serve as baseline.
(671, 2610)
(494, 1983)


In [105]:
print("Sentiment polarity feature, for example: [\"It's not organic\"]")
print(Cause.mark_polarity(["It's not organic"],to_wd=0))
sup_BOW_addpola = Cause.mark_polarity(sup_healthtweet_list,to_wd=0)
sup_pola_vectorizer,sup_BOW_pola_matrix = Cause.construct_feature_matrix(sup_BOW_addpola)
print(sup_BOW_pola_matrix.shape)
comt_BOW_addpola = Cause.mark_polarity(comt_healthtweet_list,to_wd=0)
comt_pola_vectorizer,comt_BOW_pola_matrix = Cause.construct_feature_matrix(comt_BOW_addpola)
print(comt_BOW_pola_matrix.shape)

Sentiment polarity feature, for example: ["It's not organic"]
["It's _NEG_ organic"]
(671, 2611)
(494, 1984)


In [106]:
print("Pronoun feature, for example:")    
print(Cause.mark_pronouns(['Did you know? I did\'t but they do'], binary= False))
sup_BOW_addPron_tweet = Cause.mark_pronouns(sup_healthtweet_list, binary = False)
sup_pron_vectorizer,sup_pron_matrix = Cause.construct_feature_matrix(sup_BOW_addPron_tweet)
print(sup_pron_matrix.shape)
comt_BOW_addPron_tweet = Cause.mark_pronouns(comt_healthtweet_list, binary = False)
comt_pron_vectorizer, comt_pron_matrix = Cause.construct_feature_matrix(comt_BOW_addPron_tweet)
print(comt_pron_matrix.shape)

Pronoun feature, for example:
["Did you know? I did't but they do first__person second__person third__person"]
(671, 2613)
(494, 1986)


In [107]:
sup_BOW_addCont_tweet = Cause.mark_context(sup_healthtweet_list,eco_terms)
sup_cont_vectorizer,sup_BOW_cont_matrix = Cause.construct_feature_matrix(sup_BOW_addCont_tweet)

comt_BOW_addCont_tweet = Cause.mark_context(comt_healthtweet_list,eco_terms)
comt_cont_vectorizer, comt_BOW_cont_matrix = Cause.construct_feature_matrix(comt_BOW_addCont_tweet)

print("Eco keywords features, for example:")    
print(sup_healthtweet_list[80])
print(sup_BOW_addCont_tweet[80])
print(sup_BOW_cont_matrix.shape)
print(comt_BOW_cont_matrix.shape)

Eco keywords features, for example:
RT @organictvshow: RT if you love Organic   #organic #healthy #wholefoods #nutrition @WholeFoods @Stonyfield @Horizon_Organic @Honest @orga…
rt _MENTION_organictvshow rt if you love organic _HASHTAG_organic _HASHTAG_healthy _HASHTAG_wholefoods _HASHTAG_nutrition _MENTION_wholefoods _MENTION_stonyfield _MENTION_horizon_organic _MENTION_honest _MENTION_orga
(671, 2623)
(494, 1997)


In [108]:
print("Eco keywords' context features, for example:")
print(Cause.remove_keywords(["By walking or taking your bike you won't produce greenhouse http://foo.com gas emissions"],eco_terms))
sup_BOW_rmTopic_tweet = Cause.remove_keywords(sup_healthtweet_list, eco_terms)
sup_rmtopic_vectorizer,sup_BOW_rmtopic_matrix = Cause.construct_feature_matrix(sup_BOW_rmTopic_tweet)

comt_BOW_rmTopic_tweet = Cause.remove_keywords(comt_healthtweet_list, eco_terms)
comt_rmtopic_vectorizer, comt_BOW_rmtopic_matrix = Cause.construct_feature_matrix(comt_BOW_rmTopic_tweet)

print(sup_BOW_rmtopic_matrix.shape)
print(comt_BOW_rmtopic_matrix.shape)

Eco keywords' context features, for example:
["by walking or taking your bike you won't produce   _URL_   left_context_produce right_context__url_ left_context__url_"]
(671, 2616)
(494, 1991)


In [109]:
sup_BOW_self_tweet_once = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_healthtweet_list, count="once")
sup_BOW_self_tweet_all = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_healthtweet_list, count="all")

comt_BOW_self_tweet_once = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_healthtweet_list, count="once")
comt_BOW_self_tweet_all = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_healthtweet_list, count="all")

print("Selfmention features, for example:")
print(sup_healthtweet_list[4])
print(sup_BOW_self_tweet_once[4])
print(sup_BOW_self_tweet_all[4])

Selfmention features, for example:
NEW! IZZE Sparkling Water Beverage. Certified USDA organic and delicious. #organic #SparkleBrightly 💧✨ http://t.co/rkeeMm7lOu
NEW! IZZE Sparkling Water Beverage. Certified USDA organic and delicious. #organic #SparkleBrightly 💧✨ http://t.co/rkeeMm7lOu _SELF_
_SELF_NEW! _SELF_IZZE _SELF_Sparkling _SELF_Water _SELF_Beverage. _SELF_Certified _SELF_USDA _SELF_organic _SELF_and _SELF_delicious. _SELF_#organic _SELF_#SparkleBrightly _SELF_💧✨ _SELF_http://t.co/rkeeMm7lOu 


In [110]:
sup_BOW_pos_tags,sup_BOW_pos_wdtags = Cause.mark_pos(sup_healthtweet_list)
sup_pos_vectorizer,sup_BOW_pos_matrix = Cause.construct_feature_matrix(sup_BOW_pos_wdtags)

comt_BOW_pos_tags, comt_BOW_pos_wdtags = Cause.mark_pos(comt_healthtweet_list)
comt_pos_vectorizer, comt_BOW_pos_matrix = Cause.construct_feature_matrix(comt_BOW_pos_wdtags)

print("Part-of-speech tagging features, for example:")
print(sup_healthtweet_list[0])
print(sup_BOW_pos_tags[0])
print(sup_BOW_pos_wdtags[0])
print(sup_BOW_pos_matrix.shape)
print(comt_BOW_pos_matrix.shape)

Part-of-speech tagging features, for example:
@emilyhalford Because our products contain dairy, they are not vegan!
NN IN PRP$ NNS VBP NN PRP VBP RB JJ 
_MENTION_emilyhalford NN because IN our PRP$ products NNS contain VBP dairy NN they PRP are VBP not RB vegan JJ 
(671, 2631)
(494, 2007)


**[brands-health]** section 2.1: GrisearchCV to find best parameters for countvectorizer

In [111]:
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    #('tfidf', TfidfTransformer()),
    ('lr', LogisticRegression()),
])

parameters = {
    'vect__min_df': (1, 3, 5,10,0.3),
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__tokenizer': (None,Cause.tw_tokenize_with_features),
    #'vect__max_features': (None, 1000, 2000, 3000),
    #'vect__ngram_range': ((1, 1), (1, 2),(1,3)),  # unigrams or bigrams
    #'vect__binary': (True, False),
    
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),

    'lr__penalty': ('l2', 'l1'),
    'lr__class_weight': ("balanced",None)
}

In [26]:
Cause.do_grid_search(pipeline,parameters,data = sup_healthtweet_list, label = sup_healthlabel_list, score='f1')

Performing grid search...
pipeline: ['vect', 'lr']
parameters:
{'lr__class_weight': ('balanced', None),
 'lr__penalty': ('l2', 'l1'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__min_df': (1, 3, 5, 10, 0.3),
 'vect__tokenizer': (None,
                     <function tw_tokenize_with_features at 0x7fa4fc1537b8>)}
Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done  88 tasks      | elapsed:    1.4s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    5.0s finished


done in 6.144s

Best score: 0.901
Best parameters set:
	lr__class_weight: None
	lr__penalty: 'l2'
	vect__max_df: 0.5
	vect__min_df: 1
	vect__tokenizer: None


In [27]:
Cause.do_grid_search(pipeline,parameters,data = comt_healthtweet_list, label = comt_healthlabel_list, score='f1')

Performing grid search...
pipeline: ['vect', 'lr']
parameters:
{'lr__class_weight': ('balanced', None),
 'lr__penalty': ('l2', 'l1'),
 'vect__max_df': (0.5, 0.75, 1.0),
 'vect__min_df': (1, 3, 5, 10, 0.3),
 'vect__tokenizer': (None,
                     <function tw_tokenize_with_features at 0x7fa4fc1537b8>)}
Fitting 3 folds for each of 120 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Done 124 tasks      | elapsed:    1.6s


done in 5.099s

Best score: 0.700
Best parameters set:
	lr__class_weight: 'balanced'
	lr__penalty: 'l2'
	vect__max_df: 0.75
	vect__min_df: 1
	vect__tokenizer: None


[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:    4.0s finished


**[brands-health]** section 2.2: word2vec features: <br>
2.2.1 Tweet vector and tweet cause relevance score <br>
2.2.2 Top-n(n=3,5) words, top-n words' vector and top-n words' cause relevance scores <br>
2.2.3 Cause keywords(sim>=0.30), number of cause keywords, cause keywords' vector, cause words' relevance scores <br>
2.2.4 Context words (window = 1), context vector, context words' contribution scores <br>

In [112]:
print("Tweet vector:")
sup_W2V_tw_vector = Cause.construct_tw_vector(sup_healthtweet_list,GN_model)
print(sup_W2V_tw_vector.shape)
comt_W2V_tw_vector = Cause.construct_tw_vector(comt_healthtweet_list,GN_model)
print(comt_W2V_tw_vector.shape)

Tweet vector:
(671, 300)
(494, 300)


In [29]:
print("Tweet cause-relevance score:")
sup_W2V_tw_score = Cause.calculate_tw_w2v_score(sup_healthtweet_list,GN_model,health_keywords)
print((sup_W2V_tw_score.shape))
comt_W2V_tw_score = Cause.calculate_tw_w2v_score(comt_healthtweet_list,GN_model,health_keywords)
print((comt_W2V_tw_score.shape))
print(sup_healthtweet_list[10])
print(sup_W2V_tw_score[10])

Tweet cause-relevance score:
(671, 1)
(494, 1)
@ASButtland chips (sugar, chocolate liquor, cocoa butter, soy lecithin, natural flavor) I'd recommend checking the package as ingredients...
[ 0.634]


In [30]:
print("Rand words by cause-relevance:\nFor example:")
sup_tweet_rankedwd_list = Cause.rank_match_words(sup_healthtweet_list, GN_model, health_keywords)
comt_tweet_rankedwd_list = Cause.rank_match_words(comt_healthtweet_list, GN_model, health_keywords)
print(sup_healthtweet_list[0])
print(sup_tweet_rankedwd_list[0])

Rand words by cause-relevance:
For example:
@emilyhalford Because our products contain dairy, they are not vegan!
[('vegan', '0.694'), ('dairy', '0.467'), ('products', '0.321'), ('contain', '0.169')]


In [31]:
print("Top-n words in each tweet:\nFor example:")
sup_W2V_topnwd, sup_W2V_topnwd_scores = Cause.get_topn_words(sup_tweet_rankedwd_list,n=3)
sup_topn_vectorizer,sup_W2V_topnwd_matrix = Cause.construct_feature_matrix(sup_W2V_topnwd)
print(sup_healthtweet_list[0])
print(sup_tweet_rankedwd_list[0])
print(sup_W2V_topnwd[0])
print(sup_W2V_topnwd_scores[0])

print(sup_W2V_topnwd_matrix.shape)

comt_W2V_topnwd, comt_W2V_topnwd_scores = Cause.get_topn_words(comt_tweet_rankedwd_list,n=3)
comt_topn_vectorizer, comt_W2V_topnwd_matrix = Cause.construct_feature_matrix(comt_W2V_topnwd)
print(comt_W2V_topnwd_matrix.shape)

Top-n words in each tweet:
For example:
@emilyhalford Because our products contain dairy, they are not vegan!
[('vegan', '0.694'), ('dairy', '0.467'), ('products', '0.321'), ('contain', '0.169')]
vegan dairy products 
[ 0.694  0.467  0.321]
(671, 419)
(494, 274)


In [32]:
print("Each tweet is represented by mean of top-n words' vectors:")
sup_W2V_topnwd_vectors = Cause.get_topn_vectors(sup_tweet_rankedwd_list,GN_model,n=3)
print(sup_W2V_topnwd_vectors.shape)
comt_W2V_topnwd_vectors = Cause.get_topn_vectors(comt_tweet_rankedwd_list,GN_model,n=3)
print(comt_W2V_topnwd_vectors.shape)

Each tweet is represented by mean of top-n words' vectors:
(671, 300)
(494, 300)


In [33]:
print("Words with cause-relevance score >= 0.3 serve as cause keywords:\nOrganized as:[relevance-score, leftword_contribution, rightword_contribution].\nFor example:")
sup_W2V_topicwd_list,sup_tweet_topicwd_tp_list = Cause.get_topic_words(sup_healthtweet_list,GN_model,health_keywords,threshold = 0.30)
sup_topic_vectorizer,sup_Topicwd_matrix = Cause.construct_feature_matrix(sup_W2V_topicwd_list)
print(sup_healthtweet_list[10])
print(sup_W2V_topicwd_list[10])
print(sup_tweet_topicwd_tp_list[10])
print(sup_Topicwd_matrix.shape)

comt_W2V_topicwd_list, comt_tweet_topicwd_tp_list = Cause.get_topic_words(comt_healthtweet_list,GN_model,health_keywords,threshold = 0.30)
comt_topic_vectorizer, comt_Topicwd_matrix = Cause.construct_feature_matrix(comt_W2V_topicwd_list)
print(comt_Topicwd_matrix.shape)

Words with cause-relevance score >= 0.3 serve as cause keywords:
Organized as:[relevance-score, leftword_contribution, rightword_contribution].
For example:
@ASButtland chips (sugar, chocolate liquor, cocoa butter, soy lecithin, natural flavor) I'd recommend checking the package as ingredients...
sugar chocolate cocoa butter soy lecithin natural flavor ingredients 
{'chocolate': [0.444, 0.534, 0.278], 'lecithin': [0.459, 0.502, 0.228], 'cocoa': [0.319, 0.186, 0.249], 'natural': [0.478, 0.228, 0.144], 'sugar': [0.396, 0.131, 0.534], 'ingredients': [0.481, 0.057, 0.0], 'soy': [0.501, 0.284, 0.502], 'flavor': [0.381, 0.144, -0.039], 'butter': [0.3, 0.249, 0.284]}
(671, 455)
(494, 378)


In [34]:
print("Each tweet's keywords' vector represented by mean of cause-relevant words' vectors:")
sup_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,sup_tweet_topicwd_tp_list)
print(sup_W2V_topicwd_vectors.shape)

comt_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,comt_tweet_topicwd_tp_list)
print(comt_W2V_topicwd_vectors.shape)

Each tweet's keywords' vector represented by mean of cause-relevant words' vectors:
(671, 300)
(494, 300)


In [35]:
print("Get number of cause keywords, keywords' cause-relevance scores, keywords' left word contribution scores, keywords' right word contribution scores.")
sup_W2V_topicwd_ct,sup_W2V_topicwd_score,sup_W2V_topicwd_leftcontri,sup_W2V_topicwd_rightcontri = Cause.sep_topic_features(sup_tweet_topicwd_tp_list)
comt_W2V_topicwd_ct,comt_W2V_topicwd_score,comt_W2V_topicwd_leftcontri,comt_W2V_topicwd_rightcontri = Cause.sep_topic_features(comt_tweet_topicwd_tp_list)

Get number of cause keywords, keywords' cause-relevance scores, keywords' left word contribution scores, keywords' right word contribution scores.


In [36]:
print("Number of cause keywords in each tweet, for example:")
print(sup_healthtweet_list[150])
print(sup_tweet_topicwd_tp_list[150])
print(sup_W2V_topicwd_ct[150])

Number of cause keywords in each tweet, for example:
So good you’ll want to pop them in your fruit basket -- fresh new scents from Ulta Beauty Collection! https://t.co/5mZ9zP8wCp
{'beauty': [0.302, 0.147, 0.147], 'fruit': [0.376, 0.098, 0.149], 'fresh': [0.366, 0.134, 0.445], 'good': [0.308, 0.301, 0.344]}
[4]


In [37]:
print("Cause keywords' cause-relevance scores, for example:")
print(sup_healthtweet_list[1])
print(sup_tweet_topicwd_tp_list[1])
print(sup_W2V_topicwd_score[1])

Cause keywords' cause-relevance scores, for example:
Always end up eating too much when I have yummy Chinese food!! #nowcantmove Mx
{'yummy': [0.552, 0.036, 0.248], 'food': [0.508, 0.22, 0.0], 'eating': [0.54, 0.045, 0.233]}
[0.552, 0.508, 0.54]


In [38]:
print("Sum of keywords' cause-relevance scores, for example:")
sup_W2V_topicwd_sum = Cause.get_topicwd_score_sum(sup_W2V_topicwd_score)
comt_W2V_topicwd_sum = Cause.get_topicwd_score_sum(comt_W2V_topicwd_score)
print(sup_W2V_topicwd_score[1])
print(sup_W2V_topicwd_sum[1])

Sum of keywords' cause-relevance scores, for example:
[0.552, 0.508, 0.54]
[ 1.6]


In [39]:
print("keywors' left context words' contribution scores:")
print(sup_W2V_topicwd_leftcontri.shape)# not fixed elementes

keywors' left context words' contribution scores:
(671,)


In [40]:
print("Sum of keywords' context words' contribution scores:")
sup_W2V_contri_score = Cause.get_contri_sum(sup_tweet_topicwd_tp_list)
comt_W2V_contri_score = Cause.get_contri_sum(comt_tweet_topicwd_tp_list)
print(sup_W2V_contri_score.shape)
print(sup_W2V_contri_score[150])

Sum of keywords' context words' contribution scores:
(671, 3)
[ 0.623  0.945  0.953]


**[brands-health]** section 3: train and evaluate support classifier

In [41]:
#Use logistic regression as a basic classifier.
sup_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

**[brands-health]** section 3.1: Evaluating linguistic features, word embedding features and combination of various features

In [113]:
#bag-of-words
sup_bow_vectorizer,sup_BOW_tw_matrix = Cause.construct_feature_matrix(sup_BOW_tweet)

#bag-of-words + polarity
sup_bowneg_vectorizer,sup_BOW_neg_matrix = Cause.construct_feature_matrix(sup_BOW_addpola)

#bag-of-words + pronoun
sup_pron_vectorizer,sup_BOW_pron_matrix = Cause.construct_feature_matrix(sup_BOW_addPron_tweet)

#bag-of-words + keywords' context
sup_cont_vectorizer,sup_BOW_cont_matrix = Cause.construct_feature_matrix(sup_BOW_addCont_tweet)

#bag-of-words + remove causekeywords
sup_rmeco_vectorizer,sup_BOW_rmTopic_matrix = Cause.construct_feature_matrix(sup_BOW_rmTopic_tweet)

#bag-of-words + self_mention
sup_BOW_self_tweet_once = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_healthtweet_list, count="once")
sup_mention1_vectorizer,sup_BOW_mentionOnce_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_once)
#pickle.dump(mention1_vectorizer, open(PATH+"Tweet_Health_self.vectorizer", 'wb'))

#bag-of-words + self_mention to all words
sup_BOW_self_tweet_all = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_healthtweet_list, count="all")
sup_mentionAll_vectorizer,sup_BOW_mentionAll_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
sup_BOW_self_pron_tweet = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_BOW_addPron_tweet, count="once")
sup_mention1pron_vectorizer,sup_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix(sup_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
sup_BOW_selfall_pron_tweet = Cause.selfmention(sup_healthbrand_list,healthbrand_nameid,sup_BOW_addPron_tweet, count="all")
sup_mentionAllpron_vectorizer,sup_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix(sup_BOW_selfall_pron_tweet)

In [114]:
sup_Lingu_feature_names = ["sup_BOW_tw_matrix","sup_BOW_neg_matrix","sup_BOW_pron_matrix","sup_BOW_cont_matrix", 
                           "sup_BOW_rmTopic_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_mentionAll_matrix", 
                           "sup_BOW_mentionOncePron_matrix", "sup_BOW_mentionAllPron_matrix"]

sup_Lingu_features = [sup_BOW_tw_matrix, sup_BOW_neg_matrix, sup_BOW_pron_matrix, sup_BOW_cont_matrix, sup_BOW_rmTopic_matrix,
        sup_BOW_mentionOnce_matrix, sup_BOW_mentionAll_matrix, sup_BOW_mentionOncePron_matrix, sup_BOW_mentionAllPron_matrix]
sup_Lingu_feature_dict = {}
for i in range(len(sup_Lingu_feature_names)):
    sup_Lingu_feature_dict[sup_Lingu_feature_names[i]] = sup_Lingu_features[i]

In [116]:
Cause.eva_bow_feature(sup_Lingu_feature_dict,sup_healthlabel_list,sup_lr,mycv,score_func='f1')            

sup_BOW_tw_matrix	0.909
sup_BOW_mentionOnce_matrix	0.908
sup_BOW_neg_matrix	0.907
sup_BOW_mentionOncePron_matrix	0.907
sup_BOW_cont_matrix	0.906
sup_BOW_rmTopic_matrix	0.906
sup_BOW_pron_matrix	0.906
sup_BOW_mentionAll_matrix	0.889
sup_BOW_mentionAllPron_matrix	0.886


In [68]:
sup_W2V_feature_names = ["sup_W2V_tw_score", "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_scores", 
                     "sup_W2V_topnwd_vectors", "sup_Topicwd_matrix", "sup_W2V_topicwd_ct", "sup_W2V_topicwd_sum", 
                     "sup_W2V_contri_score", "sup_W2V_topicwd_vectors"]

sup_W2V_features = [sup_W2V_tw_score, sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_scores, sup_W2V_topnwd_vectors, 
                sup_Topicwd_matrix, sup_W2V_topicwd_ct, sup_W2V_topicwd_sum, sup_W2V_contri_score, sup_W2V_topicwd_vectors]
sup_W2V_feature_dict = {}
for i in range(len(sup_W2V_feature_names)):
    sup_W2V_feature_dict[sup_W2V_feature_names[i]] = sup_W2V_features[i]

In [69]:
Cause.eva_w2v_feature(sup_W2V_feature_dict,sup_ecolabel_list,sup_lr,mycv,score_func='f1') 

sup_W2V_topicwd_vectors	0.930
sup_W2V_topnwd_vectors	0.928
sup_W2V_tw_vector	0.927
sup_Topicwd_matrix	0.920
sup_W2V_topnwd_matrix	0.915
sup_W2V_topnwd_scores	0.886
sup_W2V_topicwd_sum	0.876
sup_W2V_contri_score	0.874
sup_W2V_tw_score	0.869
sup_W2V_topicwd_ct	0.864


In [70]:
sup_COM_feature_names = ["sup_BOW_tw_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_mentionOncePron_matrix","sup_BOW_cont_matrix",
                     "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_vectors", "sup_W2V_topicwd_vectors"]

sup_COM_features = [sup_BOW_tw_matrix,sup_BOW_mentionOnce_matrix,sup_BOW_mentionOncePron_matrix,sup_BOW_cont_matrix,
                sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_vectors, sup_W2V_topicwd_vectors]
sup_COM_feature_dict = {}
for i in range(len(sup_COM_feature_names)):
    sup_COM_feature_dict[sup_COM_feature_names[i]] = sup_COM_features[i]

In [71]:
Cause.eva_comb_feature(sup_COM_feature_names,sup_COM_feature_dict,sup_ecolabel_list,sup_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
sup_BOW_tw_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.936
sup_BOW_mentionOnce_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.935
sup_BOW_mentionAllPron_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.935
sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.935
sup_BOW_tw_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.935
sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.935
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.935
sup_BOW_mentionOnce_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.934
sup_BOW_mentionAllPron_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.933
sup_BOW_tw_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors	0.933


In [42]:
com_Best_feature_names=["sup_BOW_tw_matrix","sup_BOW_cont_matrix","sup_W2V_tw_vector", "sup_W2V_topnwd_vectors"]
com_Best_feature_list = [sup_BOW_tw_matrix,sup_BOW_cont_matrix,sup_W2V_tw_vector, sup_W2V_topnwd_vectors]
com_Best_feature_dict = {}
for i in range(len(com_Best_feature_names)):
    com_Best_feature_dict[com_Best_feature_names[i]] = com_Best_feature_names[i]
   
com_Best_feature = com_Best_feature_list[0]
for feature in com_Best_feature_list[1:]:
    com_Best_feature = np.hstack((com_Best_feature,feature))

print("precision:%.3f" % np.mean(cross_val_score(sup_lr, com_Best_feature, sup_healthlabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(sup_lr, com_Best_feature, sup_healthlabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(sup_lr, com_Best_feature, sup_healthlabel_list,cv=mycv,scoring='f1')))

precision:0.918
recall:0.954
f1:0.936


**[brands-health]** section 3.2: Evaluate different classifiers

In [75]:
lr = LogisticRegression(multi_class ='ovr',penalty='l2',class_weight="balanced")
gnb = GaussianNB()
rf = RandomForestClassifier()
nn = MLPClassifier(solver='lbfgs',alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)

In [76]:
Cause.eva_classifier(com_Best_feature,sup_healthlabel_list,mycv,score_func='f1',classifier_list = [lr,gnb,rf,nn])

0.923	LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
0.811	GaussianNB(priors=None)
0.893	RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
0.923	MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5

**[brands-health]** section 3.3 Analyze terms that have high coefficients

In [45]:
sup_BOW_lr = LogisticRegression(penalty="l2",class_weight="balanced")
sup_BOW_lr.fit(sup_BOW_mentionOnce_matrix, sup_healthlabel_list)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [46]:
print("Top 20 positive coefficient words:")
for i in np.argsort(sup_BOW_lr.coef_[0])[::-1][:20]:
    print('%20s\t%.3f' % (sup_mention1_vectorizer.get_feature_names()[i], sup_BOW_lr.coef_[0][i]))

Top 20 positive coefficient words:
             healthy	3.136
          nutritious	2.664
             organic	1.817
    _HASHTAG_healthy	1.785
    _HASHTAG_organic	1.606
      _HASHTAG_vegan	1.594
           healthier	1.588
             natural	1.400
           wholesome	1.276
                free	0.933
               foods	0.848
               vegan	0.834
            calories	0.831
              fruits	0.786
                help	0.744
               great	0.732
          vegetarian	0.725
                know	0.682
             veggies	0.664
                skin	0.653


In [47]:
print("Top 20 negative coefficient words:")
for i in np.argsort(sup_BOW_lr.coef_[0])[::1][:20]:
    print('%20s\t%.3f' % (sup_mention1_vectorizer.get_feature_names()[i], sup_BOW_lr.coef_[0][i]))

Top 20 negative coefficient words:
               water	-1.198
   _HASHTAG_cleanser	-1.033
              flavor	-0.883
                  hi	-0.818
           chocolate	-0.732
             animals	-0.725
              fruity	-0.717
                 pop	-0.656
             looking	-0.617
             contain	-0.609
 _MENTION_doratahair	-0.601
               sleek	-0.601
  _HASHTAG_sleekchic	-0.601
                time	-0.582
              cereal	-0.579
              cheese	-0.572
               enjoy	-0.572
               isn't	-0.556
          ingredient	-0.553
                life	-0.538


**[brands-health]** Section 4: train and evaluate commitment classifier with manually labeled tweets

In [48]:
#Use logistic regression as a basic classifier.
comt_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

**[brands-health]** section 4.1: Evaluating linguistic features, word embedding features and combination of various features

In [49]:
#bag-of-words
comt_bow_vectorizer,comt_BOW_tw_matrix = Cause.construct_feature_matrix(comt_BOW_tweet)

#bag-of-words + polarity
comt_bowneg_vectorizer,comt_BOW_neg_matrix = Cause.construct_feature_matrix(comt_BOW_addpola)

#bag-of-words + pronoun
comt_pron_vectorizer,comt_BOW_pron_matrix = Cause.construct_feature_matrix(comt_BOW_addPron_tweet)

#bag-of-words + keywords' context
comt_cont_vectorizer,comt_BOW_cont_matrix = Cause.construct_feature_matrix(comt_BOW_addCont_tweet)

#bag-of-words + remove causekeywords
comt_rmeco_vectorizer,comt_BOW_rmTopic_matrix = Cause.construct_feature_matrix(comt_BOW_rmTopic_tweet)

#bag-of-words + self_mention
comt_BOW_self_tweet_once = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_healthtweet_list, count="once")
comt_mention1_vectorizer,comt_BOW_mentionOnce_matrix = Cause.construct_feature_matrix(comt_BOW_self_tweet_once)
#pickle.dump(mention1_vectorizer, open(PATH+"Tweet_Health_self.vectorizer", 'wb'))

#bag-of-words + self_mention to all words
comt_BOW_self_tweet_all = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_healthtweet_list, count="all")
comt_mentionAll_vectorizer,comt_BOW_mentionAll_matrix = Cause.construct_feature_matrix(comt_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
comt_BOW_self_pron_tweet = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_BOW_addPron_tweet, count="once")
comt_mention1pron_vectorizer,comt_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix(comt_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
comt_BOW_selfall_pron_tweet = Cause.selfmention(comt_healthbrand_list,healthbrand_nameid,comt_BOW_addPron_tweet, count="all")
comt_mentionAllpron_vectorizer,comt_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix(comt_BOW_selfall_pron_tweet)

In [79]:
comt_Lingu_feature_names = ["comt_BOW_tw_matrix","comt_BOW_neg_matrix","comt_BOW_pron_matrix","comt_BOW_cont_matrix", 
                           "comt_BOW_rmTopic_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAll_matrix", 
                           "comt_BOW_mentionOncePron_matrix", "comt_BOW_mentionAllPron_matrix"]

comt_Lingu_features = [comt_BOW_tw_matrix, comt_BOW_neg_matrix, comt_BOW_pron_matrix, comt_BOW_cont_matrix, comt_BOW_rmTopic_matrix,
        comt_BOW_mentionOnce_matrix, comt_BOW_mentionAll_matrix, comt_BOW_mentionOncePron_matrix, comt_BOW_mentionAllPron_matrix]
comt_Lingu_feature_dict = {}
for i in range(len(comt_Lingu_feature_names)):
    comt_Lingu_feature_dict[comt_Lingu_feature_names[i]] = comt_Lingu_features[i]

In [80]:
Cause.eva_bow_feature(comt_Lingu_feature_dict,comt_healthlabel_list,comt_lr,mycv,score_func='f1')            

comt_BOW_mentionAllPron_matrix	0.742
comt_BOW_mentionAll_matrix	0.732
comt_BOW_mentionOncePron_matrix	0.723
comt_BOW_mentionOnce_matrix	0.712
comt_BOW_neg_matrix	0.683
comt_BOW_tw_matrix	0.680
comt_BOW_rmTopic_matrix	0.679
comt_BOW_cont_matrix	0.679
comt_BOW_pron_matrix	0.675


In [87]:
comt_W2V_feature_names = ["comt_W2V_tw_vector", "comt_W2V_topnwd_matrix", "comt_W2V_topnwd_scores", 
                     "comt_W2V_topnwd_vectors", "comt_Topicwd_matrix", "comt_W2V_topicwd_ct", 
                     "comt_W2V_contri_score", "comt_W2V_topicwd_vectors"]

comt_W2V_features = [comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_scores, comt_W2V_topnwd_vectors, 
                comt_Topicwd_matrix, comt_W2V_topicwd_ct, comt_W2V_contri_score, comt_W2V_topicwd_vectors]
comt_W2V_feature_dict = {}
for i in range(len(comt_W2V_feature_names)):
    comt_W2V_feature_dict[comt_W2V_feature_names[i]] = comt_W2V_features[i]

In [88]:
Cause.eva_w2v_feature(comt_W2V_feature_dict,comt_healthlabel_list,comt_lr,mycv,score_func='f1') 

comt_W2V_tw_vector	0.697
comt_W2V_topicwd_vectors	0.677
comt_Topicwd_matrix	0.671
comt_W2V_topnwd_vectors	0.656
comt_W2V_topnwd_matrix	0.640
comt_W2V_topnwd_scores	0.416
comt_W2V_topicwd_ct	0.399
comt_W2V_contri_score	0.322


In [101]:
comt_COM_feature_names = ["comt_BOW_tw_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAll_matrix","comt_BOW_mentionAllPron_matrix","comt_BOW_cont_matrix",
                     "comt_W2V_tw_vector", "comt_W2V_topnwd_matrix", "comt_W2V_topnwd_vectors", "comt_W2V_topicwd_vectors"]

comt_COM_features = [comt_BOW_tw_matrix,comt_BOW_mentionOnce_matrix,comt_BOW_mentionAll_matrix,comt_BOW_mentionAllPron_matrix,comt_BOW_cont_matrix,
                comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_vectors, comt_W2V_topicwd_vectors]
comt_COM_feature_dict = {}
for i in range(len(comt_COM_feature_names)):
    comt_COM_feature_dict[comt_COM_feature_names[i]] = comt_COM_features[i]

In [102]:
Cause.eva_comb_feature(comt_COM_feature_names,comt_COM_feature_dict,comt_healthlabel_list,comt_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
comt_BOW_mentionAll_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_cont_matrix + comt_W2V_topnwd_vectors + comt_W2V_topicwd_vectors	0.750
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_topnwd_matrix + comt_W2V_topnwd_vectors + comt_W2V_topicwd_vectors	0.750
comt_BOW_mentionAll_matrix + comt_BOW_mentionAllPron_matrix	0.749
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix + comt_W2V_topicwd_vectors	0.749
comt_BOW_tw_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix + comt_W2V_topicwd_vectors	0.749
comt_BOW_mentionAll_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector	0.749
comt_BOW_mentionAll_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_cont_matrix + comt_W2V_topnwd_vectors	0.747
comt_BOW_tw_matrix + comt_BOW_mentionAll_matrix + comt_BOW_mentionAllPron_

In [50]:
com_Best_feature_names2=["comt_BOW_mentionOnce_matrix", "comt_BOW_mentionAllPron_matrix","comt_W2V_topnwd_matrix",
                         "comt_W2V_topnwd_vectors","comt_W2V_topicwd_vectors"]
com_Best_feature_list2 = [comt_BOW_mentionOnce_matrix, comt_BOW_mentionAllPron_matrix,comt_W2V_topnwd_matrix,
                         comt_W2V_topnwd_vectors,comt_W2V_topicwd_vectors]
com_Best_feature_dict2 = {}
for i in range(len(com_Best_feature_names2)):
    com_Best_feature_dict2[com_Best_feature_names2[i]] = com_Best_feature_names2[i]
   
comt_com_Best_feature2 = com_Best_feature_list2[0]
for feature in com_Best_feature_list2[1:]:
    comt_com_Best_feature2 = np.hstack((comt_com_Best_feature2,feature))

print("precision:%.3f" % np.mean(cross_val_score(comt_lr, comt_com_Best_feature2, comt_healthlabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(comt_lr, comt_com_Best_feature2, comt_healthlabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(comt_lr, comt_com_Best_feature2, comt_healthlabel_list,cv=mycv,scoring='f1')))

precision:0.760
recall:0.739
f1:0.747


**[brands-health]** section 4.2: Evaluate different classifiers

In [104]:
lr = LogisticRegression(multi_class ='ovr',penalty='l2',class_weight="balanced")
gnb = GaussianNB()
rf = RandomForestClassifier()
nn = MLPClassifier(solver='lbfgs',alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)

In [105]:
Cause.eva_classifier(comt_com_Best_feature2,comt_healthlabel_list,mycv,score_func='f1',classifier_list = [lr,gnb,rf,nn])

0.747	LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
0.649	GaussianNB(priors=None)
0.597	RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
0.731	MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5

**[brands-health]** section 4.3: Analyze terms that have high coefficients

In [51]:
comt_BOW_lr = LogisticRegression(penalty="l2",class_weight="balanced")
comt_BOW_lr.fit(comt_BOW_mentionAll_matrix, comt_healthlabel_list)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [52]:
print("Top 20 positive coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::-1][:20]:
    print('%20s\t%.3f' % (comt_mentionAll_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 positive coefficient words:
            _self_rt	1.331
             flavors	1.045
         ingredients	1.029
          _self_with	1.006
             natural	0.988
     _HASHTAG_nongmo	0.981
               vegan	0.924
              dishes	0.863
                 new	0.847
       _self_organic	0.829
               bread	0.799
                 add	0.788
                just	0.765
           certified	0.745
                 tea	0.726
                 gmo	0.696
               sweet	0.651
                   2	0.647
_self__HASHTAG_vegan	0.632
       _self_natural	0.609


In [53]:
print("Top 20 negative coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::1][:20]:
    print('%20s\t%.3f' % (comt_mentionAll_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 negative coefficient words:
                  rt	-1.363
              eating	-1.064
    _HASHTAG_healthy	-1.019
               foods	-1.005
                 eat	-0.875
                hair	-0.865
                best	-0.834
               right	-0.797
           important	-0.712
               think	-0.704
                 raw	-0.637
              recipe	-0.633
         _self_fruit	-0.632
              simple	-0.624
             veggies	-0.610
       _self_healthy	-0.610
                  vs	-0.590
                diet	-0.583
              sounds	-0.559
                 i'm	-0.556


**[brands-health]** section 5: Apply pre-trained classifiers to predict for unseen tweets

**[brands-health]** section 5.1: Apply support classifier to classify all brands' tweets into support and non-support classes

In [56]:
sup_lr.fit(com_Best_feature,sup_healthlabel_list)
Cause.healthbrand_predict_label_0_1(sup_lr,healthbrand_nameid,eco_terms,GN_model,health_keywords,sup_bow_vectorizer,sup_cont_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_score_test.txt",
                             "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt")

Note: this code do prediction for each brand, it takes about 8~10 hours to run for each dataset
processing 1 brand: oldelpaso
processing 2 brand: roman_meal
processing 3 brand: seagate
finished!


**[brands-health]** section 5.2: Apply commitment classifier to classify all brands' support tweets into high- and low- commitment classes

In [59]:
comt_lr.fit(comt_com_Best_feature2,comt_healthlabel_list)
Cause.healthbrand_predict_label_2_3(comt_lr,healthbrand_nameid,eco_terms,comt_mention1_vectorizer,comt_mentionAllpron_vectorizer,comt_topn_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt",
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_2_3_test.txt")         

TypeError: healthbrand_predict_label_2_3() takes 7 positional arguments but 8 were given

**[brands-health]** section 6: aggregate each entity's cause-commitment tweets and compare with action score to find inauthentic entities

In [60]:
entity_pred_info, entity_pred2_tw, entity_pred3_tw = Cause.get_aggregate_info("/data/2/zwang/2017_S/Tweet_Health/142brand_tweet_predict_proba_01_2_3.txt",
                                                         sim_limit=0.3,prob_limit=0.7)


Get data for 140 entities


In [62]:
remain_entity_predicts, remove_entity = Cause.filt_entity(entity_pred_info, healthbrand_score_dict, ntw_threshold=0)

130 entities remain after remove entities (number of tweets<0)


**[brands-health]** section 6.1: Apply different aggregation methods to select entities that have high word-ratings

In [63]:
entity_n3,entity_frac3,entity_prob3,words_topn_entities = Cause.aggregation(remain_entity_predicts,topn=120)

**[brands-health]** section 6.2: Sort high word-rating entities by their action-rating and select top-n (high word-rating but low action-rating) as inauthentic entities

In [64]:
inauthentic_entities = Cause.inauthentic(words_topn_entities,healthbrand_score_dict,n=10)

In [65]:
print("entity\taction_score\tn_label3\tfrac_label3\tprob_label3\n")
for entity in inauthentic_entities:
    print("%s\t%f\t%d\t%f\t%f" % (entity,float(healthbrand_score_dict[entity]),int(entity_n3[entity]),float(entity_frac3[entity]),
                                  float(entity_prob3[entity])))

entity	action_score	n_label3	frac_label3	prob_label3

coffee_mate	1.400000	9	1.000000	0.930667
ampenergy	1.400000	3	0.428571	0.847667
monsterenergysa	1.400000	1	1.000000	0.746000
sprite	1.500000	3	1.000000	0.855667
littledebbie	1.500000	5	0.312500	0.876200
powerade	1.500000	2	1.000000	0.826500
cocacola	1.500000	3	0.750000	0.800333
gatorade	1.500000	18	0.720000	0.786944
haagendazs_us	1.800000	6	1.000000	0.887000
hellmanns	1.900000	2	0.400000	0.926500


#### Dataset 2: brands with eco cause

**[brands-eco]** section 1: select cause-relevant tweets as training data

In [7]:
ecobrand_score_dict, ecobrand_sector,ecobrand_nameid = Cause.get_brand_info(ECO_SCORE,"twitter","TGS")

Get 1017 brands with TGS


In [8]:
print("Sector distribution of eco brands:")
eco_sector = Counter()
eco_sector.update(ecobrand_sector.values())
eco_sector

Sector distribution of eco brands:


Counter({'Apparel': 108,
         'Appliances': 23,
         'Car': 43,
         'Electronics': 61,
         'Food': 410,
         'Household Che': 29,
         'Lighting Prod': 3,
         'Paper Product': 25,
         'Personal Care': 298,
         'Pet Food': 17})

In [9]:
ecobrand_tweets_dict = Cause.read_brand_tweets(BRAND_PATH+"tweets.pruned.json.gz", list(ecobrand_score_dict.keys()),
                                            cause="eco")

read 500000 lines
read 1000000 lines
read 1500000 lines
read 2000000 lines
read 2500000 lines
Collected 2624800 tweets for 966 eco brands in total.


In [10]:
ecobrand_twID_dict, ecobrand_twID_twtext = Cause.dedup_tweets(ecobrand_tweets_dict,cause="eco")

processed 100 entities
processed 200 entities
processed 300 entities
processed 400 entities
processed 500 entities
processed 600 entities
processed 700 entities
processed 800 entities
processed 900 entities
2280489 non-duplicate tweets for 966 eco brands


In [14]:
ecobrand_twID_twScore = Cause.score_tweet_by_relevance(ecobrand_twID_twtext,GN_model,eco_keywords)

NameError: name 'ecobrand_twID_twtext' is not defined

In [None]:
healthbrand_twIDScore = Cause.sort_tweet_by_score(healthbrand_twID_twScore,healthbrand_twID_dict)
#Cause.select_topn_tweets(filename,ecobrand_twIDScore,ecobrand_twID_twtext,topn=1)

**[brands-eco]** section 2 & 3: feature enginerring & train and evaluate support classifier

> Labeled data for support classification

In [11]:
sup_ecobrand_list, sup_ecotweet_list, sup_ecolabel_list = Cause.data_for_sup_clf(brand_eco_path, entity='eco-brand')

Read 308 positive instances and 658 negative instances for support classification


In [12]:
sup_eco_neg_terms, sup_eco_pos_terms = Cause.get_freq_terms(sup_ecotweet_list,sup_ecolabel_list)

In [13]:
print("Most common terms in negative class (non-support):")
sup_eco_neg_terms.most_common(20)

Most common terms in negative class (non-support):


[('_URL_', 403),
 ('natural', 73),
 ('rt', 66),
 ('_NUMBER_', 45),
 ('amp', 39),
 ('us', 34),
 ('new', 32),
 ('habitat', 32),
 ("it's", 31),
 ('water', 28),
 ('food', 26),
 ('beautiful', 26),
 ('skin', 25),
 ('know', 25),
 ('beauty', 25),
 ('life', 25),
 ('great', 23),
 ('like', 23),
 ('forest', 23),
 ('nature', 22)]

In [14]:
print("Most common terms in positive class (support):")
sup_eco_pos_terms.most_common(20)

Most common terms in positive class (support):


[('_URL_', 218),
 ('environment', 48),
 ('_NUMBER_', 43),
 ('sustainable', 41),
 ('rt', 34),
 ('help', 30),
 ('amp', 29),
 ('learn', 26),
 ('planet', 25),
 ('environmental', 22),
 ('sustainability', 20),
 ('water', 20),
 ('earth', 19),
 ('climate', 18),
 ('support', 18),
 ('day', 17),
 ('protect', 16),
 ("we're", 16),
 ('energy', 15),
 ('committed', 15)]

**[brands-eco]** section 3.1: Evaluating linguistic features, word embedding features and combination of various features

In [None]:
sup_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

In [16]:
#bag-of-words
sup_BOW_tweet = sup_ecotweet_list
sup_bow_vectorizer,sup_BOW_tw_matrix = Cause.construct_feature_matrix(sup_BOW_tweet)

#bag-of-words + polarity
sup_BOW_addpola = Cause.mark_polarity(sup_ecotweet_list,to_wd=0)
sup_bowneg_vectorizer,sup_BOW_neg_matrix = Cause.construct_feature_matrix(sup_BOW_addpola)

#bag-of-words + pronoun
sup_BOW_addPron_tweet = Cause.mark_pronouns(sup_ecotweet_list, binary = False)
sup_pron_vectorizer,sup_BOW_pron_matrix = Cause.construct_feature_matrix(sup_BOW_addPron_tweet)

#bag-of-words + keywords' context
sup_BOW_addCont_tweet = Cause.mark_context(sup_ecotweet_list,eco_terms)
sup_cont_vectorizer,sup_BOW_cont_matrix = Cause.construct_feature_matrix(sup_BOW_addCont_tweet)

#bag-of-words + remove cause keywords
sup_BOW_rmTopic_tweet = Cause.remove_keywords(sup_ecotweet_list, eco_terms)
sup_rmTopic_vectorizer,sup_BOW_rmTopic_matrix = Cause.construct_feature_matrix(sup_BOW_rmTopic_tweet)

#bag-of-words + self_mention
sup_BOW_self_tweet_once = Cause.selfmention(sup_ecobrand_list,ecobrand_nameid,sup_ecotweet_list, count="once")
sup_mention1_vectorizer,sup_BOW_mentionOnce_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_once)

#bag-of-words + self_mention to all words
sup_BOW_self_tweet_all = Cause.selfmention(sup_ecobrand_list,ecobrand_nameid,sup_ecotweet_list, count="all")
sup_mentionAll_vectorizer,sup_BOW_mentionAll_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
sup_BOW_self_pron_tweet = Cause.selfmention(sup_ecobrand_list,ecobrand_nameid,sup_BOW_addPron_tweet, count="once")
sup_mention1pron_vectorizer,sup_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix(sup_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
sup_BOW_selfall_pron_tweet = Cause.selfmention(sup_ecobrand_list,ecobrand_nameid,sup_BOW_addPron_tweet, count="all")
sup_mentionAllpron_vectorizer,sup_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix(sup_BOW_selfall_pron_tweet)

In [118]:
sup_Lingu_feature_names = ["sup_BOW_tw_matrix","sup_BOW_neg_matrix","sup_BOW_pron_matrix","sup_BOW_cont_matrix", 
                           "sup_BOW_rmTopic_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_mentionAll_matrix", 
                           "sup_BOW_mentionOncePron_matrix", "sup_BOW_mentionAllPron_matrix"]

sup_Lingu_features = [sup_BOW_tw_matrix, sup_BOW_neg_matrix, sup_BOW_pron_matrix, sup_BOW_cont_matrix, sup_BOW_rmTopic_matrix,
        sup_BOW_mentionOnce_matrix, sup_BOW_mentionAll_matrix, sup_BOW_mentionOncePron_matrix, sup_BOW_mentionAllPron_matrix]
sup_Lingu_feature_dict = {}
for i in range(len(sup_Lingu_feature_names)):
    sup_Lingu_feature_dict[sup_Lingu_feature_names[i]] = sup_Lingu_features[i]

In [119]:
Cause.eva_bow_feature(sup_Lingu_feature_dict,sup_ecolabel_list,sup_lr,mycv,score_func='f1')            

sup_BOW_neg_matrix	0.716
sup_BOW_cont_matrix	0.716
sup_BOW_tw_matrix	0.713
sup_BOW_pron_matrix	0.711
sup_BOW_mentionOnce_matrix	0.709
sup_BOW_mentionOncePron_matrix	0.709
sup_BOW_mentionAllPron_matrix	0.675
sup_BOW_mentionAll_matrix	0.670
sup_BOW_rmTopic_matrix	0.651


In [17]:
sup_W2V_tw_score = Cause.calculate_tw_w2v_score(sup_ecotweet_list,GN_model,eco_keywords)
sup_W2V_tw_vector = Cause.construct_tw_vector(sup_ecotweet_list,GN_model)

sup_tweet_rankedwd_list = Cause.rank_match_words(sup_ecotweet_list, GN_model, eco_keywords)
sup_W2V_topnwd, sup_W2V_topnwd_scores = Cause.get_topn_words(sup_tweet_rankedwd_list,n=3)
sup_topn_vectorizer,sup_W2V_topnwd_matrix = Cause.construct_feature_matrix(sup_W2V_topnwd)

sup_W2V_topnwd_vectors = Cause.get_topn_vectors(sup_tweet_rankedwd_list,GN_model,n=3)

sup_W2V_topicwd_list,sup_tweet_topicwd_tp_list = Cause.get_topic_words(sup_ecotweet_list,GN_model,eco_keywords,threshold = 0.30)
sup_topic_vectorizer,sup_Topicwd_matrix = Cause.construct_feature_matrix(sup_W2V_topicwd_list)

sup_W2V_topicwd_ct,sup_W2V_topicwd_score,sup_W2V_topicwd_leftcontri,sup_W2V_topicwd_rightcontri = Cause.sep_topic_features(sup_tweet_topicwd_tp_list)

sup_W2V_topicwd_sum = Cause.get_topicwd_score_sum(sup_W2V_topicwd_score)

sup_W2V_contri_score = Cause.get_contri_sum(sup_tweet_topicwd_tp_list)

sup_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,sup_tweet_topicwd_tp_list)

In [121]:
sup_W2V_feature_names = ["sup_W2V_tw_score", "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_scores", 
                     "sup_W2V_topnwd_vectors", "sup_Topicwd_matrix", "sup_W2V_topicwd_ct", "sup_W2V_topicwd_sum", 
                     "sup_W2V_contri_score", "sup_W2V_topicwd_vectors"]

sup_W2V_features = [sup_W2V_tw_score, sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_scores, sup_W2V_topnwd_vectors, 
                sup_Topicwd_matrix, sup_W2V_topicwd_ct, sup_W2V_topicwd_sum, sup_W2V_contri_score, sup_W2V_topicwd_vectors]
sup_W2V_feature_dict = {}
for i in range(len(sup_W2V_feature_names)):
    sup_W2V_feature_dict[sup_W2V_feature_names[i]] = sup_W2V_features[i]

In [122]:
Cause.eva_w2v_feature(sup_W2V_feature_dict,sup_ecolabel_list,sup_lr,mycv,score_func='f1') 

sup_W2V_topnwd_vectors	0.775
sup_W2V_tw_vector	0.742
sup_W2V_topnwd_matrix	0.684
sup_Topicwd_matrix	0.658
sup_W2V_topicwd_vectors	0.647
sup_W2V_topicwd_ct	0.433
sup_W2V_topicwd_sum	0.428
sup_W2V_contri_score	0.406
sup_W2V_topnwd_scores	0.393
sup_W2V_tw_score	0.254


In [123]:
sup_COM_feature_names = ["sup_BOW_tw_matrix","sup_BOW_neg_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_cont_matrix",
                     "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_vectors", "sup_W2V_topicwd_vectors"]

sup_COM_features = [sup_BOW_tw_matrix,sup_BOW_neg_matrix,sup_BOW_mentionOnce_matrix,sup_BOW_cont_matrix,
                sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_vectors, sup_W2V_topicwd_vectors]
sup_COM_feature_dict = {}
for i in range(len(sup_COM_feature_names)):
    sup_COM_feature_dict[sup_COM_feature_names[i]] = sup_COM_features[i]

In [124]:
Cause.eva_comb_feature(sup_COM_feature_names,sup_COM_feature_dict,sup_ecolabel_list,sup_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.822
sup_BOW_tw_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.818
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.818
sup_BOW_neg_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.817
sup_BOW_mentionOnce_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.817
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors	0.815
sup_BOW_mentionOnce_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.812
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.811
sup_BOW_mentionOnce_matrix + sup_W2V_tw_vector + 

In [18]:
sup_com_Best_feature_names=["sup_BOW_cont_matrix","sup_W2V_tw_vector", "sup_W2V_topnwd_vectors"]
sup_com_Best_feature_list = [sup_BOW_cont_matrix,sup_W2V_tw_vector, sup_W2V_topnwd_vectors]
sup_com_Best_feature_dict = {}
for i in range(len(sup_com_Best_feature_names)):
    sup_com_Best_feature_dict[sup_com_Best_feature_names[i]] = sup_com_Best_feature_names[i]
   
sup_com_Best_feature = sup_com_Best_feature_list[0]
for feature in sup_com_Best_feature_list[1:]:
    sup_com_Best_feature = np.hstack((sup_com_Best_feature,feature))

print("precision:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_ecolabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_ecolabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_ecolabel_list,cv=mycv,scoring='f1')))

precision:0.865
recall:0.783
f1:0.821


**[brands-eco]** section 2 & 4: feature enginerring & train and evaluate commitment classifier

> Labeled data for commitment classification

In [19]:
comt_ecobrand_list, comt_ecotweet_list, comt_ecolabel_list = Cause.data_for_commit_clf(brand_eco_path,entity='eco-brand')

Read 148 positive instances and 160 negative instances for commitment classification


In [20]:
comt_eco_neg_terms, comt_eco_pos_terms = Cause.get_freq_terms(comt_ecotweet_list,comt_ecolabel_list)

In [21]:
print("Most common terms in negative class (low-commitment):")
comt_eco_neg_terms.most_common(20)

Most common terms in negative class (low-commitment):


[('_URL_', 118),
 ('environment', 23),
 ('rt', 22),
 ('planet', 20),
 ('_NUMBER_', 16),
 ('sustainable', 15),
 ('help', 14),
 ('day', 14),
 ('climate', 13),
 ('earth', 13),
 ('carbon', 12),
 ('amp', 11),
 ('great', 10),
 ('water', 10),
 ('future', 10),
 ('learn', 10),
 ('_HASHTAG_earthday', 10),
 ('plants', 9),
 ('know', 9),
 ("it's", 9)]

In [22]:
print("Most common terms in negative class (high-commitment):")
comt_eco_neg_terms.most_common(20)

Most common terms in negative class (high-commitment):


[('_URL_', 118),
 ('environment', 23),
 ('rt', 22),
 ('planet', 20),
 ('_NUMBER_', 16),
 ('sustainable', 15),
 ('help', 14),
 ('day', 14),
 ('climate', 13),
 ('earth', 13),
 ('carbon', 12),
 ('amp', 11),
 ('great', 10),
 ('water', 10),
 ('future', 10),
 ('learn', 10),
 ('_HASHTAG_earthday', 10),
 ('plants', 9),
 ('know', 9),
 ("it's", 9)]

**[brands-eco]** section 4.1: Evaluating linguistic features, word embedding features and combination of various features

In [None]:
comt_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

In [24]:
#bag-of-words
comt_BOW_tweet = comt_ecotweet_list
comt_bow_vectorizer,comt_BOW_tw_matrix = Cause.construct_feature_matrix(comt_BOW_tweet)

#bag-of-words + polarity
comt_BOW_addpola = Cause.mark_polarity(comt_ecotweet_list,to_wd=0)
comt_bowneg_vectorizer,comt_BOW_neg_matrix = Cause.construct_feature_matrix(comt_BOW_addpola)

#bag-of-words + pronoun
comt_BOW_addPron_tweet = Cause.mark_pronouns(comt_ecotweet_list, binary = False)
comt_pron_vectorizer,comt_BOW_pron_matrix = Cause.construct_feature_matrix(comt_BOW_addPron_tweet)

#bag-of-words + keywords' context
comt_BOW_addCont_tweet = Cause.mark_context(comt_ecotweet_list,eco_terms)
comt_cont_vectorizer,comt_BOW_cont_matrix = Cause.construct_feature_matrix(comt_BOW_addCont_tweet)

#bag-of-words + remove cause keywords
comt_BOW_rmTopic_tweet = Cause.remove_keywords(comt_ecotweet_list, eco_terms)
comt_rmTopic_vectorizer,comt_BOW_rmTopic_matrix = Cause.construct_feature_matrix(comt_BOW_rmTopic_tweet)

#bag-of-words + self_mention
comt_BOW_self_tweet_once = Cause.selfmention(comt_ecobrand_list,ecobrand_nameid,comt_ecotweet_list, count="once")
comt_mention1_vectorizer,comt_BOW_mentionOnce_matrix = Cause.construct_feature_matrix(comt_BOW_self_tweet_once)

#bag-of-words + self_mention to all words
comt_BOW_self_tweet_all = Cause.selfmention(comt_ecobrand_list,ecobrand_nameid,comt_ecotweet_list, count="all")
comt_mentionAll_vectorizer,comt_BOW_mentionAll_matrix = Cause.construct_feature_matrix(comt_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
comt_BOW_self_pron_tweet = Cause.selfmention(comt_ecobrand_list,ecobrand_nameid,comt_BOW_addPron_tweet, count="once")
comt_mention1pron_vectorizer,comt_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix(comt_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
comt_BOW_selfall_pron_tweet = Cause.selfmention(comt_ecobrand_list,ecobrand_nameid,comt_BOW_addPron_tweet, count="all")
comt_mentionAllpron_vectorizer,comt_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix(comt_BOW_selfall_pron_tweet)

In [23]:
comt_Lingu_feature_names = ["comt_BOW_tw_matrix","comt_BOW_neg_matrix","comt_BOW_pron_matrix","comt_BOW_cont_matrix", 
                           "comt_BOW_rmTopic_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAll_matrix", 
                           "comt_BOW_mentionOncePron_matrix", "comt_BOW_mentionAllPron_matrix"]

comt_Lingu_features = [comt_BOW_tw_matrix, comt_BOW_neg_matrix, comt_BOW_pron_matrix, comt_BOW_cont_matrix, comt_BOW_rmTopic_matrix,
        comt_BOW_mentionOnce_matrix, comt_BOW_mentionAll_matrix, comt_BOW_mentionOncePron_matrix, comt_BOW_mentionAllPron_matrix]
comt_Lingu_feature_dict = {}
for i in range(len(comt_Lingu_feature_names)):
    comt_Lingu_feature_dict[comt_Lingu_feature_names[i]] = comt_Lingu_features[i]

In [128]:
Cause.eva_bow_feature(comt_Lingu_feature_dict,comt_ecolabel_list,sup_lr,mycv,score_func='f1')            

comt_BOW_mentionAllPron_matrix	0.706
comt_BOW_mentionOncePron_matrix	0.698
comt_BOW_mentionOnce_matrix	0.670
comt_BOW_rmTopic_matrix	0.656
comt_BOW_mentionAll_matrix	0.654
comt_BOW_pron_matrix	0.646
comt_BOW_cont_matrix	0.639
comt_BOW_neg_matrix	0.612
comt_BOW_tw_matrix	0.612


In [25]:
comt_W2V_tw_score = Cause.calculate_tw_w2v_score(comt_ecotweet_list,GN_model,eco_keywords)
comt_W2V_tw_vector = Cause.construct_tw_vector(comt_ecotweet_list,GN_model)

comt_tweet_rankedwd_list = Cause.rank_match_words(comt_ecotweet_list, GN_model, eco_keywords)
comt_W2V_topnwd, comt_W2V_topnwd_scores = Cause.get_topn_words(comt_tweet_rankedwd_list,n=3)
comt_topn_vectorizer,comt_W2V_topnwd_matrix = Cause.construct_feature_matrix(comt_W2V_topnwd)

comt_W2V_topnwd_vectors = Cause.get_topn_vectors(comt_tweet_rankedwd_list,GN_model,n=3)

comt_W2V_topicwd_list,comt_tweet_topicwd_tp_list = Cause.get_topic_words(comt_ecotweet_list,GN_model,eco_keywords,threshold = 0.30)
comt_topic_vectorizer,comt_Topicwd_matrix = Cause.construct_feature_matrix(comt_W2V_topicwd_list)

comt_W2V_topicwd_ct,comt_W2V_topicwd_score,comt_W2V_topicwd_leftcontri,comt_W2V_topicwd_rightcontri = Cause.sep_topic_features(comt_tweet_topicwd_tp_list)

comt_W2V_topicwd_sum = Cause.get_topicwd_score_sum(comt_W2V_topicwd_score)

comt_W2V_contri_score = Cause.get_contri_sum(comt_tweet_topicwd_tp_list)

comt_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,comt_tweet_topicwd_tp_list)

In [136]:
comt_W2V_feature_names = ["comt_W2V_tw_vector", "comt_W2V_topnwd_matrix",  
                     "comt_W2V_topnwd_vectors", "comt_Topicwd_matrix", "comt_W2V_topicwd_ct", "comt_W2V_topicwd_sum", 
                     "comt_W2V_contri_score", "comt_W2V_topicwd_vectors"]

comt_W2V_features = [comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_vectors, 
                comt_Topicwd_matrix, comt_W2V_topicwd_ct, comt_W2V_topicwd_sum, comt_W2V_contri_score, comt_W2V_topicwd_vectors]
comt_W2V_feature_dict = {}
for i in range(len(comt_W2V_feature_names)):
    comt_W2V_feature_dict[comt_W2V_feature_names[i]] = comt_W2V_features[i]

In [137]:
Cause.eva_w2v_feature(comt_W2V_feature_dict,comt_ecolabel_list,comt_lr,mycv,score_func='f1') 

comt_W2V_tw_vector	0.653
comt_W2V_topnwd_vectors	0.587
comt_W2V_topnwd_matrix	0.581
comt_Topicwd_matrix	0.556
comt_W2V_topicwd_vectors	0.543
comt_W2V_contri_score	0.530
comt_W2V_topicwd_ct	0.413
comt_W2V_topicwd_sum	0.400


In [138]:
comt_COM_feature_names = ["comt_BOW_tw_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAllPron_matrix","comt_BOW_rmTopic_matrix",
                     "comt_W2V_tw_vector", "comt_W2V_topnwd_matrix", "comt_W2V_topnwd_vectors", "comt_W2V_topicwd_vectors"]

comt_COM_features = [comt_BOW_tw_matrix,comt_BOW_mentionOnce_matrix,comt_BOW_mentionAllPron_matrix,comt_BOW_rmTopic_matrix,
                comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_vectors, comt_W2V_topicwd_vectors]
comt_COM_feature_dict = {}
for i in range(len(comt_COM_feature_names)):
    comt_COM_feature_dict[comt_COM_feature_names[i]] = comt_COM_features[i]

In [139]:
Cause.eva_comb_feature(comt_COM_feature_names,comt_COM_feature_dict,comt_ecolabel_list,comt_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix + comt_W2V_tw_vector	0.721
comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.718
comt_BOW_tw_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.713
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.712
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.709
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector	0.709
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix	0.709
comt_BOW_tw_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix	0.708
comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix + comt_

In [26]:
comt_com_Best_feature_names=["comt_BOW_tw_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAllPron_matrix","comt_BOW_rmTopic_matrix", "comt_W2V_tw_vector"]
comt_com_Best_feature_list = [comt_BOW_tw_matrix,comt_BOW_mentionOnce_matrix,comt_BOW_mentionAllPron_matrix, comt_BOW_rmTopic_matrix,comt_W2V_tw_vector]
comt_com_Best_feature_dict = {}
for i in range(len(comt_com_Best_feature_names)):
    comt_com_Best_feature_dict[comt_com_Best_feature_names[i]] = comt_com_Best_feature_names[i]
   
comt_com_Best_feature = comt_com_Best_feature_list[0]
for feature in comt_com_Best_feature_list[1:]:
    comt_com_Best_feature = np.hstack((comt_com_Best_feature,feature))

print("precision:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_ecolabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_ecolabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_ecolabel_list,cv=mycv,scoring='f1')))

precision:0.773
recall:0.677
f1:0.721


**[brands-eco]** section 4.2: Evaluate different classifiers

In [145]:
lr = LogisticRegression(multi_class ='ovr',penalty='l2',class_weight="balanced")
gnb = GaussianNB()
rf = RandomForestClassifier()
nn = MLPClassifier(solver='lbfgs',alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)

In [147]:
Cause.eva_classifier(com_Best_feature,comt_ecolabel_list,mycv,score_func='f1',classifier_list = [lr,gnb,rf,nn])

0.709	LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
0.602	GaussianNB(priors=None)
0.545	RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
0.708	MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5

**[brands-eco]** section 4.3: Analyze terms that have high coefficients

In [148]:
comt_BOW_lr = LogisticRegression(penalty="l2",class_weight="balanced")
comt_BOW_lr.fit(comt_BOW_mentionAllPron_matrix, comt_ecolabel_list)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [149]:
print("Top 20 positive coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::-1][:20]:
    print('%20s\t%.3f' % (comt_mentionAllpron_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 positive coefficient words:
       first__person	1.166
               we're	1.122
           committed	0.971
             protect	0.849
                work	0.719
             restore	0.717
           _self_our	0.698
            _NUMBER_	0.683
             greener	0.676
           encourage	0.636
                palm	0.628
      sustainability	0.627
 _self_first__person	0.624
                 oil	0.560
            _self_rt	0.557
              member	0.542
       environmental	0.541
               apple	0.528
                 new	0.526
                harm	0.510


In [150]:
print("Top 20 negative coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::1][:20]:
    print('%20s\t%.3f' % (comt_mentionAllpron_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 negative coefficient words:
              planet	-0.978
                  rt	-0.906
              plants	-0.731
       third__person	-0.729
              _self_	-0.712
              oceans	-0.695
                 day	-0.692
      second__person	-0.690
               great	-0.687
                life	-0.649
               today	-0.623
              carbon	-0.573
                   3	-0.572
         responsible	-0.568
             climate	-0.534
          pesticides	-0.532
                home	-0.496
            reducing	-0.479
               ideas	-0.479
             natural	-0.466


**[brands-eco]** section 5: Apply pre-trained classifiers to predict for unseen tweets

**[brands-eco]** section 5.1: Apply support classifier to classify all brands' tweets into support and non-support classes

In [27]:
sup_lr.fit(sup_com_Best_feature,sup_ecolabel_list)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [28]:
Cause.ecobrand_predict_label_0_1(sup_lr,ecobrand_nameid,eco_terms,GN_model,eco_keywords,sup_cont_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_score_test.txt",
                             "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt")

Note: this code do prediction for each brand, it takes about 8~10 hours to run for each dataset
processing 1 brand: oldelpaso
processing 2 brand: roman_meal
processing 3 brand: seagate
finished!


**[brands-eco]** section 5.2: Apply commitment classifier to classify all brands' support tweets into high- and low- commitment classes

In [29]:
comt_lr.fit(comt_com_Best_feature,comt_ecolabel_list)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [30]:
Cause.ecobrand_predict_label_2_3(comt_lr,GN_model,ecobrand_nameid,eco_terms,eco_keywords,comt_bow_vectorizer,comt_mention1_vectorizer,comt_mentionAllpron_vectorizer,comt_rmTopic_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt",
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_2_3_test.txt")         


**[brands-eco]** section 6: aggregate each entity's cause-commitment tweets and compare with action score to find inauthentic entities

In [160]:
entity_pred_info, entity_pred2_tw, entity_pred3_tw = Cause.get_aggregate_info("/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_predict_proba_01_2_3.txt",
                                                         sim_limit=0.3,prob_limit=0.7)

Get data for 922 entities


In [161]:
remain_entity_predicts, remove_entity = Cause.filt_entity(entity_pred_info, healthbrand_score_dict, ntw_threshold=0)

127 entities remain after remove entities (number of tweets<0)


**[brands-eco]** section 6.1: Apply different aggregation methods to select entities that have high word-ratings

In [162]:
entity_n3,entity_frac3,entity_prob3,words_topn_entities = Cause.aggregation(remain_entity_predicts,topn=120)

**[brands-eco]** section 6.2: Sort high word-rating entities by their action-rating and select top-n (high word-rating but low action-rating) as inauthentic entities

In [168]:
inauthentic_entities = Cause.inauthentic(words_topn_entities,ecobrand_score_dict,n=10)
print("entity\taction_score\tn_label3\tfrac_label3\tprob_label3\n")
for entity in inauthentic_entities:
    print("%s\t%s\t%d\t%f\t%f" % (entity,ecobrand_score_dict[entity],int(entity_n3[entity]),float(entity_frac3[entity]),
                                  float(entity_prob3[entity])))

entity	action_score	n_label3	frac_label3	prob_label3

lovemyphilly		0	0.000000	0.000000
pomega5		1	0.012048	0.977000
thefrownies		0	0.000000	0.000000
miraclewhip		0	0.000000	0.000000
mdmoms		2	0.045455	0.918500
eatwholly		0	0.000000	0.000000
gourmet	3.2	1	0.016129	0.717000
manentailbeauty	3.2	0	0.000000	0.000000
butterlondon	3.2	1	0.166667	0.911000
bronnerbros	3.2	0	0.000000	0.000000


#### Dataset 3: member of Congress (MOC for short) with eco cause

**[MOC-eco]** section 1: select cause-relevant tweets as training data

In [7]:
MOC_nameid_dict, MOC_rating, MOC_party_dict, MOC_state_dict, MOC_tweets_dict = Cause.read_moc_tweets(CONGRESS_PATH)

Read tweets for 100 congress members
Read tweets for 200 congress members
Read tweets for 300 congress members
Read tweets for 400 congress members
Reapeted congress member: AustinScottGA08
Reapeted congress member: RepDannyDavis
Reapeted congress member: RepMaloney
Reapeted congress member: RepAlGreen
Reapeted congress member: RepEBJ
Read tweets for 500 congress members
Reapeted congress member: LorettaSanchez
Collected 1118962 tweets for 514 congress members in total.


In [32]:
MOC_twID_dict, MOC_twID_twtext = Cause.dedup_tweets(MOC_tweets_dict,cause="eco")

processed 100 entities
processed 200 entities
processed 300 entities
processed 400 entities
processed 500 entities
1096604 non-duplicate tweets for 514 eco brands


In [None]:
MOC_twID_twScore = Cause.score_tweet_by_relevance(MOC_twID_twtext,GN_model,eco_keywords)

In [None]:
MOC_twIDScore = Cause.sort_tweet_by_score(MOC_twID_twScore,MOC_twID_dict)

In [None]:
#Cause.select_topn_tweets(filename,MOC_twIDScore,MOC_twID_twtext,topn=1)

**[MOC-eco]** section 2 & 3: feature enginerring & train and evaluate support classifier

> Labeled data for support classification

In [49]:
sup_moc_list, sup_moctweet_list, sup_moclabel_list = Cause.data_for_sup_clf(moc_eco_path, entity='eco-moc')

Read 379 positive instances and 133 negative instances for support classification


In [50]:
sup_moc_neg_terms, sup_moc_pos_terms = Cause.get_freq_terms(sup_moctweet_list,sup_moclabel_list)

In [51]:
print("Most common terms in negative class (non-support):")
sup_moc_neg_terms.most_common(20)

Most common terms in negative class (non-support):


[('_URL_', 87),
 ('rt', 18),
 ('amp', 16),
 ('economy', 15),
 ('jobs', 14),
 ('energy', 13),
 ('_NUMBER_', 12),
 ('water', 10),
 ('climate', 9),
 ('communities', 9),
 ('economic', 9),
 ('epa', 8),
 ('lives', 8),
 ('forest', 7),
 ('need', 7),
 ('industry', 7),
 ('world', 7),
 ('environment', 7),
 ('families', 7),
 ('nation', 7)]

In [52]:
print("Most common terms in positive class (support):")
sup_moc_pos_terms.most_common(20)

Most common terms in positive class (support):


[('_URL_', 255),
 ('amp', 92),
 ('protect', 57),
 ('climate', 51),
 ('wildlife', 49),
 ('species', 49),
 ('rt', 41),
 ('conservation', 37),
 ('change', 36),
 ('environment', 36),
 ('_NUMBER_', 33),
 ('water', 33),
 ('pollution', 31),
 ('health', 28),
 ('forests', 27),
 ('protecting', 26),
 ('endangered', 26),
 ('habitat', 25),
 ('future', 25),
 ('today', 24)]

**[MOC-eco]** section 3.1: Evaluating linguistic features, word embedding features and combination of various features

In [53]:
sup_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

In [54]:
#bag-of-words
sup_BOW_tweet = sup_moctweet_list
sup_bow_vectorizer,sup_BOW_tw_matrix = Cause.construct_feature_matrix(sup_BOW_tweet)

#bag-of-words + polarity
sup_BOW_addpola = Cause.mark_polarity(sup_moctweet_list,to_wd=0)
sup_bowneg_vectorizer,sup_BOW_neg_matrix = Cause.construct_feature_matrix(sup_BOW_addpola)

#bag-of-words + pronoun
sup_BOW_addPron_tweet = Cause.mark_pronouns(sup_moctweet_list, binary = False)
sup_pron_vectorizer,sup_BOW_pron_matrix = Cause.construct_feature_matrix(sup_BOW_addPron_tweet)

#bag-of-words + keywords' context
sup_BOW_addCont_tweet = Cause.mark_context(sup_moctweet_list,eco_terms)
sup_cont_vectorizer,sup_BOW_cont_matrix = Cause.construct_feature_matrix(sup_BOW_addCont_tweet)

#bag-of-words + remove cause keywords
sup_BOW_rmTopic_tweet = Cause.remove_keywords(sup_moctweet_list, eco_terms)
sup_rmTopic_vectorizer,sup_BOW_rmTopic_matrix = Cause.construct_feature_matrix(sup_BOW_rmTopic_tweet)

#bag-of-words + self_mention
sup_BOW_self_tweet_once = Cause.selfmention(sup_moc_list,MOC_nameid_dict,sup_moctweet_list, count="once")
sup_mention1_vectorizer,sup_BOW_mentionOnce_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_once)

#bag-of-words + self_mention to all words
sup_BOW_self_tweet_all = Cause.selfmention(sup_moc_list,MOC_nameid_dict,sup_moctweet_list, count="all")
sup_mentionAll_vectorizer,sup_BOW_mentionAll_matrix = Cause.construct_feature_matrix(sup_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
sup_BOW_self_pron_tweet = Cause.selfmention(sup_moc_list,MOC_nameid_dict,sup_BOW_addPron_tweet, count="once")
sup_mention1pron_vectorizer,sup_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix(sup_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
sup_BOW_selfall_pron_tweet = Cause.selfmention(sup_moc_list,MOC_nameid_dict,sup_BOW_addPron_tweet, count="all")
sup_mentionAllpron_vectorizer,sup_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix(sup_BOW_selfall_pron_tweet)

In [55]:
sup_Lingu_feature_names = ["sup_BOW_tw_matrix","sup_BOW_neg_matrix","sup_BOW_pron_matrix","sup_BOW_cont_matrix", 
                           "sup_BOW_rmTopic_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_mentionAll_matrix", 
                           "sup_BOW_mentionOncePron_matrix", "sup_BOW_mentionAllPron_matrix"]

sup_Lingu_features = [sup_BOW_tw_matrix, sup_BOW_neg_matrix, sup_BOW_pron_matrix, sup_BOW_cont_matrix, sup_BOW_rmTopic_matrix,
        sup_BOW_mentionOnce_matrix, sup_BOW_mentionAll_matrix, sup_BOW_mentionOncePron_matrix, sup_BOW_mentionAllPron_matrix]
sup_Lingu_feature_dict = {}
for i in range(len(sup_Lingu_feature_names)):
    sup_Lingu_feature_dict[sup_Lingu_feature_names[i]] = sup_Lingu_features[i]

In [56]:
Cause.eva_bow_feature(sup_Lingu_feature_dict,sup_moclabel_list,sup_lr,mycv,score_func='f1')            

sup_BOW_mentionAll_matrix	0.873
sup_BOW_mentionOnce_matrix	0.873
sup_BOW_tw_matrix	0.873
sup_BOW_cont_matrix	0.872
sup_BOW_neg_matrix	0.871
sup_BOW_mentionAllPron_matrix	0.871
sup_BOW_rmTopic_matrix	0.869
sup_BOW_mentionOncePron_matrix	0.868
sup_BOW_pron_matrix	0.868


In [57]:
sup_W2V_tw_score = Cause.calculate_tw_w2v_score(sup_moctweet_list,GN_model,eco_keywords)
sup_W2V_tw_vector = Cause.construct_tw_vector(sup_moctweet_list,GN_model)

sup_tweet_rankedwd_list = Cause.rank_match_words(sup_moctweet_list, GN_model, eco_keywords)
sup_W2V_topnwd, sup_W2V_topnwd_scores = Cause.get_topn_words(sup_tweet_rankedwd_list,n=3)
sup_topn_vectorizer,sup_W2V_topnwd_matrix = Cause.construct_feature_matrix(sup_W2V_topnwd)

sup_W2V_topnwd_vectors = Cause.get_topn_vectors(sup_tweet_rankedwd_list,GN_model,n=3)

sup_W2V_topicwd_list,sup_tweet_topicwd_tp_list = Cause.get_topic_words(sup_moctweet_list,GN_model,eco_keywords,threshold = 0.30)
sup_topic_vectorizer,sup_Topicwd_matrix = Cause.construct_feature_matrix(sup_W2V_topicwd_list)

sup_W2V_topicwd_ct,sup_W2V_topicwd_score,sup_W2V_topicwd_leftcontri,sup_W2V_topicwd_rightcontri = Cause.sep_topic_features(sup_tweet_topicwd_tp_list)

sup_W2V_topicwd_sum = Cause.get_topicwd_score_sum(sup_W2V_topicwd_score)

sup_W2V_contri_score = Cause.get_contri_sum(sup_tweet_topicwd_tp_list)

sup_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,sup_tweet_topicwd_tp_list)

In [42]:
sup_W2V_feature_names = ["sup_W2V_tw_score", "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_scores", 
                     "sup_W2V_topnwd_vectors", "sup_Topicwd_matrix", "sup_W2V_topicwd_ct", "sup_W2V_topicwd_sum", 
                     "sup_W2V_contri_score", "sup_W2V_topicwd_vectors"]

sup_W2V_features = [sup_W2V_tw_score, sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_scores, sup_W2V_topnwd_vectors, 
                sup_Topicwd_matrix, sup_W2V_topicwd_ct, sup_W2V_topicwd_sum, sup_W2V_contri_score, sup_W2V_topicwd_vectors]
sup_W2V_feature_dict = {}
for i in range(len(sup_W2V_feature_names)):
    sup_W2V_feature_dict[sup_W2V_feature_names[i]] = sup_W2V_features[i]

In [43]:
Cause.eva_w2v_feature(sup_W2V_feature_dict,sup_moclabel_list,sup_lr,mycv,score_func='f1') 

sup_W2V_tw_vector	0.881
sup_W2V_topicwd_vectors	0.878
sup_W2V_topnwd_vectors	0.875
sup_W2V_topnwd_scores	0.871
sup_W2V_topnwd_matrix	0.864
sup_Topicwd_matrix	0.863
sup_W2V_topicwd_sum	0.862
sup_W2V_contri_score	0.857
sup_W2V_tw_score	0.851
sup_W2V_topicwd_ct	0.849


In [44]:
sup_COM_feature_names = ["sup_BOW_tw_matrix","sup_BOW_neg_matrix","sup_BOW_mentionOnce_matrix","sup_BOW_cont_matrix",
                     "sup_W2V_tw_vector", "sup_W2V_topnwd_matrix", "sup_W2V_topnwd_vectors", "sup_W2V_topicwd_vectors"]

sup_COM_features = [sup_BOW_tw_matrix,sup_BOW_neg_matrix,sup_BOW_mentionOnce_matrix,sup_BOW_cont_matrix,
                sup_W2V_tw_vector, sup_W2V_topnwd_matrix, sup_W2V_topnwd_vectors, sup_W2V_topicwd_vectors]
sup_COM_feature_dict = {}
for i in range(len(sup_COM_feature_names)):
    sup_COM_feature_dict[sup_COM_feature_names[i]] = sup_COM_features[i]

In [45]:
Cause.eva_comb_feature(sup_COM_feature_names,sup_COM_feature_dict,sup_moclabel_list,sup_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.915
sup_BOW_cont_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.914
sup_BOW_mentionOnce_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.912
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.911
sup_BOW_neg_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.910
sup_BOW_tw_matrix + sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.910
sup_BOW_cont_matrix + sup_W2V_topnwd_vectors	0.909
sup_BOW_cont_matrix + sup_W2V_tw_vector + sup_W2V_topnwd_vectors	0.908
sup_BOW_mentionOnce_matrix + sup_BOW_cont_matrix + sup_W2V_topnwd_vectors + sup_W2V_topicwd_vectors	0.907
sup_BOW_cont_mat

In [58]:
sup_com_Best_feature_names=["sup_BOW_cont_matrix","sup_W2V_tw_vector", "sup_W2V_topnwd_vectors","sup_W2V_topicwd_vectors"]
sup_com_Best_feature_list = [sup_BOW_cont_matrix,sup_W2V_tw_vector, sup_W2V_topnwd_vectors,sup_W2V_topicwd_vectors]
sup_com_Best_feature_dict = {}
for i in range(len(sup_com_Best_feature_names)):
    sup_com_Best_feature_dict[sup_com_Best_feature_names[i]] = sup_com_Best_feature_names[i]
   
sup_com_Best_feature = sup_com_Best_feature_list[0]
for feature in sup_com_Best_feature_list[1:]:
    sup_com_Best_feature = np.hstack((sup_com_Best_feature,feature))

print("precision:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_moclabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_moclabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(sup_lr, sup_com_Best_feature, sup_moclabel_list,cv=mycv,scoring='f1')))

precision:0.879
recall:0.952
f1:0.913


**[MOC-eco]** section 2 & 4: feature enginerring & train and evaluate commitment classifier

> Labeled data for commitment classification

In [18]:
comt_moc_list, comt_moctweet_list, comt_moclabel_list = Cause.data_for_commit_clf(moc_eco_path,entity='eco-moc')

Read 140 positive instances and 239 negative instances for commitment classification


In [19]:
comt_moc_neg_terms, comt_moc_pos_terms = Cause.get_freq_terms(comt_moctweet_list,comt_moclabel_list)

In [20]:
print("Most common terms in negative class (low-commitment):")
comt_moc_neg_terms.most_common(20)

Most common terms in negative class (low-commitment):


[('_URL_', 154),
 ('amp', 58),
 ('species', 33),
 ('climate', 32),
 ('protect', 30),
 ('wildlife', 28),
 ('pollution', 25),
 ('_NUMBER_', 23),
 ('water', 23),
 ('environment', 22),
 ('change', 22),
 ('rt', 21),
 ('epa', 20),
 ('conservation', 19),
 ('forests', 18),
 ('endangered', 18),
 ('health', 18),
 ('carbon', 17),
 ('communities', 17),
 ('economy', 16)]

In [21]:
print("Most common terms in negative class (high-commitment):")
comt_moc_neg_terms.most_common(20)

Most common terms in negative class (high-commitment):


[('_URL_', 154),
 ('amp', 58),
 ('species', 33),
 ('climate', 32),
 ('protect', 30),
 ('wildlife', 28),
 ('pollution', 25),
 ('_NUMBER_', 23),
 ('water', 23),
 ('environment', 22),
 ('change', 22),
 ('rt', 21),
 ('epa', 20),
 ('conservation', 19),
 ('forests', 18),
 ('endangered', 18),
 ('health', 18),
 ('carbon', 17),
 ('communities', 17),
 ('economy', 16)]

**[moc-eco]** section 4.1: Evaluating linguistic features, word embedding features and combination of various features

In [22]:
comt_lr = LogisticRegression(solver = 'lbfgs',multi_class ='ovr')

In [23]:
#bag-of-words
comt_BOW_tweet = comt_moctweet_list
comt_bow_vectorizer,comt_BOW_tw_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_tweet)

#bag-of-words + polarity
comt_BOW_addpola = Cause.mark_polarity(comt_moctweet_list,to_wd=0)
comt_bowneg_vectorizer,comt_BOW_neg_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_addpola)

#bag-of-words + pronoun
comt_BOW_addPron_tweet = Cause.mark_pronouns(comt_moctweet_list, binary = False)
comt_pron_vectorizer,comt_BOW_pron_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_addPron_tweet)

#bag-of-words + keywords' context
comt_BOW_addCont_tweet = Cause.mark_context(comt_moctweet_list,eco_terms)
comt_cont_vectorizer,comt_BOW_cont_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_addCont_tweet)

#bag-of-words + remove cause keywords
comt_BOW_rmTopic_tweet = Cause.remove_keywords(comt_moctweet_list, eco_terms)
comt_rmTopic_vectorizer,comt_BOW_rmTopic_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_rmTopic_tweet)

#bag-of-words + self_mention
comt_BOW_self_tweet_once = Cause.selfmention(comt_moc_list,MOC_nameid_dict,comt_moctweet_list, count="once")
comt_mention1_vectorizer,comt_BOW_mentionOnce_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_self_tweet_once)

#bag-of-words + self_mention to all words
comt_BOW_self_tweet_all = Cause.selfmention(comt_moc_list,MOC_nameid_dict,comt_moctweet_list, count="all")
comt_mentionAll_vectorizer,comt_BOW_mentionAll_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_self_tweet_all)

#bag-of-words + self_mention + pronoun
comt_BOW_self_pron_tweet = Cause.selfmention(comt_moc_list,MOC_nameid_dict,comt_BOW_addPron_tweet, count="once")
comt_mention1pron_vectorizer,comt_BOW_mentionOncePron_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_self_pron_tweet)

#bag-of-words + self_mention to all words + pronoun
comt_BOW_selfall_pron_tweet = Cause.selfmention(comt_moc_list,MOC_nameid_dict,comt_BOW_addPron_tweet, count="all")
comt_mentionAllpron_vectorizer,comt_BOW_mentionAllPron_matrix = Cause.construct_feature_matrix_formoc(comt_BOW_selfall_pron_tweet)

In [24]:
comt_Lingu_feature_names = ["comt_BOW_tw_matrix","comt_BOW_neg_matrix","comt_BOW_pron_matrix","comt_BOW_cont_matrix", 
                           "comt_BOW_rmTopic_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAll_matrix", 
                           "comt_BOW_mentionOncePron_matrix", "comt_BOW_mentionAllPron_matrix"]

comt_Lingu_features = [comt_BOW_tw_matrix, comt_BOW_neg_matrix, comt_BOW_pron_matrix, comt_BOW_cont_matrix, comt_BOW_rmTopic_matrix,
        comt_BOW_mentionOnce_matrix, comt_BOW_mentionAll_matrix, comt_BOW_mentionOncePron_matrix, comt_BOW_mentionAllPron_matrix]
comt_Lingu_feature_dict = {}
for i in range(len(comt_Lingu_feature_names)):
    comt_Lingu_feature_dict[comt_Lingu_feature_names[i]] = comt_Lingu_features[i]

In [25]:
Cause.eva_bow_feature(comt_Lingu_feature_dict,comt_moclabel_list,sup_lr,mycv,score_func='f1')            

comt_BOW_mentionAll_matrix	0.651
comt_BOW_mentionAllPron_matrix	0.646
comt_BOW_cont_matrix	0.630
comt_BOW_mentionOnce_matrix	0.628
comt_BOW_mentionOncePron_matrix	0.628
comt_BOW_rmTopic_matrix	0.616
comt_BOW_pron_matrix	0.587
comt_BOW_neg_matrix	0.587
comt_BOW_tw_matrix	0.587


In [36]:
comt_W2V_tw_score = Cause.calculate_tw_w2v_score(comt_moctweet_list,GN_model,eco_keywords)
comt_W2V_tw_vector = Cause.construct_tw_vector(comt_moctweet_list,GN_model)

comt_tweet_rankedwd_list = Cause.rank_match_words(comt_moctweet_list, GN_model, eco_keywords)
comt_W2V_topnwd, comt_W2V_topnwd_scores = Cause.get_topn_words(comt_tweet_rankedwd_list,n=3)
comt_topn_vectorizer,comt_W2V_topnwd_matrix = Cause.construct_feature_matrix_formoc(comt_W2V_topnwd)

comt_W2V_topnwd_vectors = Cause.get_topn_vectors(comt_tweet_rankedwd_list,GN_model,n=3)

comt_W2V_topicwd_list,comt_tweet_topicwd_tp_list = Cause.get_topic_words(comt_moctweet_list,GN_model,eco_keywords,threshold = 0.30)
comt_topic_vectorizer,comt_Topicwd_matrix = Cause.construct_feature_matrix_formoc(comt_W2V_topicwd_list)

comt_W2V_topicwd_ct,comt_W2V_topicwd_score,comt_W2V_topicwd_leftcontri,comt_W2V_topicwd_rightcontri = Cause.sep_topic_features(comt_tweet_topicwd_tp_list)

comt_W2V_topicwd_sum = Cause.get_topicwd_score_sum(comt_W2V_topicwd_score)

comt_W2V_contri_score = Cause.get_contri_sum(comt_tweet_topicwd_tp_list)

comt_W2V_topicwd_vectors = Cause.get_topicwd_vec(GN_model,comt_tweet_topicwd_tp_list)

In [37]:
comt_W2V_feature_names = ["comt_W2V_tw_vector", "comt_W2V_topnwd_matrix",  "comt_W2V_topnwd_vectors", "comt_Topicwd_matrix", 
                      "comt_W2V_topicwd_vectors"]

comt_W2V_features = [comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_vectors, 
                comt_Topicwd_matrix, comt_W2V_topicwd_vectors]
comt_W2V_feature_dict = {}
for i in range(len(comt_W2V_feature_names)):
    comt_W2V_feature_dict[comt_W2V_feature_names[i]] = comt_W2V_features[i]

In [38]:
Cause.eva_w2v_feature(comt_W2V_feature_dict,comt_moclabel_list,comt_lr,mycv,score_func='f1') 

comt_W2V_tw_vector	0.448
comt_Topicwd_matrix	0.292
comt_W2V_topnwd_vectors	0.284
comt_W2V_topnwd_matrix	0.274
comt_W2V_topicwd_vectors	0.264


In [39]:
comt_COM_feature_names = ["comt_BOW_tw_matrix","comt_BOW_mentionOnce_matrix","comt_BOW_mentionAllPron_matrix","comt_BOW_rmTopic_matrix",
                     "comt_W2V_tw_vector", "comt_W2V_topnwd_matrix", "comt_W2V_topnwd_vectors", "comt_W2V_topicwd_vectors"]

comt_COM_features = [comt_BOW_tw_matrix,comt_BOW_mentionOnce_matrix,comt_BOW_mentionAllPron_matrix,comt_BOW_rmTopic_matrix,
                comt_W2V_tw_vector, comt_W2V_topnwd_matrix, comt_W2V_topnwd_vectors, comt_W2V_topicwd_vectors]
comt_COM_feature_dict = {}
for i in range(len(comt_COM_feature_names)):
    comt_COM_feature_dict[comt_COM_feature_names[i]] = comt_COM_features[i]

In [41]:
Cause.eva_comb_feature(comt_COM_feature_names,comt_COM_feature_dict,comt_moclabel_list,comt_lr,mycv,score_func='f1')   

Combine 1 features.
Combine 2 features.
Combine 3 features.
Combine 4 features.
Combine 5 features.
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_topnwd_matrix + comt_W2V_topnwd_vectors	0.678
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.675
comt_BOW_tw_matrix + comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix	0.675
comt_BOW_tw_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix + comt_W2V_topnwd_vectors	0.675
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_matrix + comt_W2V_topnwd_vectors	0.675
comt_BOW_tw_matrix + comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_vectors	0.675
comt_BOW_mentionOnce_matrix + comt_BOW_mentionAllPron_matrix + comt_W2V_tw_vector + comt_W2V_topnwd_vectors	0.673
comt_BOW_mentionAllPron_matrix + comt_BOW_rmTopic_matrix + comt_W2

In [45]:
comt_com_Best_feature_names=["comt_BOW_tw_matrix","comt_BOW_mentionAllPron_matrix", "comt_W2V_topnwd_vectors"]
comt_com_Best_feature_list = [comt_BOW_tw_matrix, comt_BOW_mentionAllPron_matrix, comt_W2V_topnwd_vectors]
comt_com_Best_feature_dict = {}
for i in range(len(comt_com_Best_feature_names)):
    comt_com_Best_feature_dict[comt_com_Best_feature_names[i]] = comt_com_Best_feature_names[i]
   
comt_com_Best_feature = comt_com_Best_feature_list[0]
for feature in comt_com_Best_feature_list[1:]:
    comt_com_Best_feature = np.hstack((comt_com_Best_feature,feature))

print("precision:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_moclabel_list,cv=mycv,scoring='precision')))
print("recall:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_moclabel_list,cv=mycv,scoring='recall')))
print("f1:%.3f" % np.mean(cross_val_score(sup_lr, comt_com_Best_feature, comt_moclabel_list,cv=mycv,scoring='f1')))

precision:0.731
recall:0.626
f1:0.670


**[MOC-eco]** section 4.2: Evaluate different classifiers

In [46]:
lr = LogisticRegression(multi_class ='ovr',penalty='l2',class_weight="balanced")
gnb = GaussianNB()
rf = RandomForestClassifier()
nn = MLPClassifier(solver='lbfgs',alpha=1e-5, hidden_layer_sizes=(15,), random_state=1)

In [60]:
Cause.eva_classifier(sup_com_Best_feature,sup_moclabel_list,mycv,score_func='f1',classifier_list = [lr,gnb,rf,nn])

0.908	LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
0.862	GaussianNB(priors=None)
0.855	RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)
0.902	MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(15,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5

**[MOC-eco]** section 4.3: Analyze terms that have high coefficients

In [62]:
comt_BOW_lr = LogisticRegression(penalty="l2",class_weight="balanced")
comt_BOW_lr.fit(comt_BOW_mentionAllPron_matrix, comt_moclabel_list)

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [63]:
print("Top 20 positive coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::-1][:20]:
    print('%20s\t%.3f' % (comt_mentionAllpron_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 positive coefficient words:
                   i	2.490
                  my	1.837
                  me	1.453
                must	1.214
             hearing	1.124
                 day	1.087
             discuss	1.083
                part	1.082
           _self_for	1.004
                bill	0.980
  _MENTION_defenders	0.943
               about	0.919
               local	0.899
               let's	0.880
             working	0.878
                   w	0.851
               areas	0.801
             coastal	0.790
         legislation	0.742
_HASHTAG_conservation	0.741


In [64]:
print("Top 20 negative coefficient words:")
for i in np.argsort(comt_BOW_lr.coef_[0])[::1][:20]:
    print('%20s\t%.3f' % (comt_mentionAllpron_vectorizer.get_feature_names()[i], comt_BOW_lr.coef_[0][i]))

Top 20 negative coefficient words:
                 see	-0.952
                  rt	-0.942
              global	-0.830
                 are	-0.786
           pollution	-0.776
            reminder	-0.737
                come	-0.711
            historic	-0.703
                  if	-0.683
               clean	-0.649
                  is	-0.649
                more	-0.644
       _self_species	-0.608
                 oil	-0.605
                 new	-0.582
                 who	-0.576
       _self_climate	-0.570
                 you	-0.569
             warming	-0.567
                time	-0.559


**[MOC-eco]** section 5: Apply pre-trained classifiers to predict for unseen tweets

**[MOC-eco]** section 5.1: Apply support classifier to classify all brands' tweets into support and non-support classes

In [65]:
sup_lr.fit(sup_com_Best_feature,sup_moclabel_list)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
Cause.ecobrand_predict_label_0_1(sup_lr,ecobrand_nameid,eco_terms,GN_model,eco_keywords,sup_cont_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_score_test.txt",
                             "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt")

**[MOC-eco]** section 5.2: Apply commitment classifier to classify all brands' support tweets into high- and low- commitment classes

In [71]:
comt_lr.fit(comt_com_Best_feature,comt_moclabel_list)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='lbfgs', tol=0.0001,
          verbose=0, warm_start=False)

In [None]:
Cause.ecobrand_predict_label_2_3(comt_lr,GN_model,ecobrand_nameid,eco_terms,eco_keywords,comt_bow_vectorizer,comt_mention1_vectorizer,comt_mentionAllpron_vectorizer,comt_rmTopic_vectorizer,
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_23_test.txt",
                              "/data/2/zwang/2017_S/Tweet_Eco/966brand_tweet_01_2_3_test.txt")         

**[MOC-eco]** section 6: aggregate each entity's cause-commitment tweets and compare with action score to find inauthentic entities

In [69]:
entity_pred_info, entity_pred2_tw, entity_pred3_tw = Cause.get_aggregate_info("/data/2/zwang/2017_S/Congress_Eco/514moc_tweet_predict_proba_01_2_3_re2.txt",
                                                         sim_limit=0.3,prob_limit=0.7)

Get data for 510 entities


In [72]:
remain_entity_predicts, remove_entity = Cause.filt_entity(entity_pred_info, MOC_score_dict, ntw_threshold=0)

NameError: name 'MOC_score_dict' is not defined

**[MOC-eco]** section 6.1: Apply different aggregation methods to select entities that have high word-ratings

In [None]:
entity_n3,entity_frac3,entity_prob3,words_topn_entities = Cause.aggregation(remain_entity_predicts,topn=120)

**[MOC-eco]** section 6.2: Sort high word-rating entities by their action-rating and select top-n (high word-rating but low action-rating) as inauthentic entities

In [None]:
inauthentic_entities = Cause.inauthentic(words_topn_entities,moc_score_dict,n=10)

In [None]:
print("entity\taction_score\tn_label3\tfrac_label3\tprob_label3\n")
for entity in inauthentic_entities:
    print("%s\t%s\t%d\t%f\t%f" % (entity,ecobrand_score_dict[entity],int(entity_n3[entity]),float(entity_frac3[entity]),
                                  float(entity_prob3[entity])))

**[For all 3 datasets]** section 7. Fit linear regression model to analyze how does entities' word commitment level relate with action-ratings