My work with Russian Troll Tweets is divided in 3 parts due to Kaggle resources restrictions. Here are the part links:
#### [Part 1. EDA](https://www.kaggle.com/code/mathemilda/part-i-eda)
#### [Part 2. Feature Engineering](https://www.kaggle.com/code/mathemilda/part-2-feature-engineering)
#### [Part 3. Machine Learning with accuracy 99.6%](https://www.kaggle.com/code/mathemilda/part-3-machine-learning-with-accuracy-99-6) (this one)

# Outline for Part 3.
## 1. Download, intial cleaning and concatination of data sets
## 2. Feature Engineering for each account
## 3. Feature Selection
## 4. Machine Learning

### My findings for this part
* My plan was to use Deep Learning for this data set. I looked up the work of others and checked the data myself. I discovered that the approach did not yield good classification results, so I decided to add more features to create weak predictors. My "weak" predictor showed up as rather strong. 
* It turned out that the most prominent properties are the ones related to propaganda methods. Apparently trolls have specified guidelines and they stick to them.  I see it as convenient because we can set up filters for catching the most significant phenomena, and then check a whole account activity. 
* In addition the most important for prediction features turned out to be not very dependable on languages but mostly on troll account activity. Thus we can do it for other languages, and do not limit it to Russian trolls posting English texts.

___
*Remark.* For some reasons a very useful module `ftfy` cannot be found on Kaggle, although it was here a couple of years ago. I have tried to reach the Support team about it and they replied that I should look up the Kaggle forum for help. I installed it here, although as you can see I got a message that I should not do it.

In [None]:
import pandas as pd
import os
import glob
import numpy as np
import re
from string import punctuation, whitespace
import warnings
warnings.filterwarnings("ignore")
!pip install ftfy
import ftfy
from sklearn.utils import shuffle
import gc
import multiprocess as mp
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, \
        precision_score, recall_score, roc_auc_score
from catboost import CatBoostClassifier, Pool
from sklearn.base import BaseEstimator, TransformerMixin

### Data download from Russian Troll Tweets, Sentiment140 and Celebrity tweets.
I discovered the following problems with Russian Troll Tweets dataset, see https://www.kaggle.com/code/mathemilda/part-i-eda/notebook: 
* I have one missing tweet.
* There are columns which I do not need.
* I have many non English languages which I removed.
* There are German and Russian texts among tweets classified as English.

In [None]:
#../input/russian-troll-tweets/IRAhandle_tweets_1.csv
PATH = "../input/russian-troll-tweets/"
filenames = glob.glob(os.path.join(PATH, "*.csv"))
full_ru_trolls = pd.concat((pd.read_csv(f) for f in filenames))
full_ru_trolls.drop(['external_author_id', 'region', 'harvested_date',
        'updates', 'account_type', 'new_june_2018', 'post_type',
        'account_category', 'following', 'followers', 'retweet'],
        axis=1, inplace=True)
full_ru_trolls = full_ru_trolls[full_ru_trolls.content.notnull()]
full_ru_trolls['troll']=1
full_ru_trolls_en = full_ru_trolls[full_ru_trolls.language == 'English'].copy(deep=True)
full_ru_trolls_en.rename(columns={'author': 'account', 'content': 'tweet'}, inplace=True)
full_ru_trolls_en = full_ru_trolls_en[~full_ru_trolls_en.tweet.str.contains('А-Яа-я')]
german_s = re.compile('(Ich )|(Sie )|(Ihnen )|( sich$)|( [Kk]?eine? )|( [Dd]as )|'+
           '^[Dd]as |^[Ss]ind | bist | und | sind | (?!(van|von|-)) der |' + 
           '[ ^][a-z]*ö|[ ^][a-z]*ä|[ ^][a-z]*ü')
full_ru_trolls_en = full_ru_trolls_en[~full_ru_trolls_en.tweet.str.contains(german_s)].copy(deep=True)
del full_ru_trolls
full_ru_trolls_en.drop(['language', 'publish_date'],
        axis=1, inplace=True)

In datasets provided information is not consistent and column names differ for the same entities from one dataset to another. I changed all account name columns to 'account' and all tweet columns to 'tweet'. I'm not going to use other columns.

Here I download Sentiment140 and Celebrity tweets I found on Kaggle. 

In [None]:
sentiment140 = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv',\
    encoding = 'Latin-1', names=('target', 'id', 'date', 'flag', 'username','tweet'))
sentiment140.drop(['target', 'id', 'date', 'flag'], axis=1, inplace=True)
sentiment140.rename(columns={'username': 'account'}, inplace=True)
sentiment140['troll']=0

PATH = "../input/RawTwitterFeeds"
filenames = glob.glob(os.path.join(PATH, "*.csv"))
celebs = pd.concat((pd.read_csv(f) for f in filenames))
celebs.drop(['Unnamed: 0', 'Unnamed: 0.1','id', 'date', 'link', 'retweet'], axis=1,inplace=True)
celebs.rename(columns={'author': 'account', 'text': 'tweet'}, inplace=True)
celebs = celebs[celebs.tweet.notnull()]
celebs['troll']=0

Now let us gather all in one data frame.

In [None]:
used_cols = ['account', 'tweet', 'troll']
total_data = pd.concat([celebs[used_cols], sentiment140[used_cols], full_ru_trolls_en[used_cols]], \
                     ignore_index = True)

The function below preprocesses a tweet, extracts properties and removes some things to get a more or less clean text. Look Part 2 for more detail on why I chose the particular properties.

In [None]:
dashes = [chr(int(d, 16)) for d in ['058A', '05BE', '1400', '1806', '2010', '2011',\
          '2012', '2013', '2014', '2015', '2053', '207B', '208B', '2212', '2E17', \
          '2E1A', '2E3A', '2E3B', '2E40', '2E5D', '301C', '3030', '30A0', 'FE31', \
          'FE32', 'FE58', 'FE63', 'FF0D', '10EAD']]
dashes_compiled = re.compile('[' + ''.join(dashes) + ']+', flags = re.UNICODE)

def cleaning_and_counts(s):
    s = ftfy.fix_text(s)
    s = re.sub(dashes_compiled, '-', s) # all dashes should be the same
    url_n = len(re.findall('https?://\\S+\\b', s)) # count urls
    s = re.sub('https?://\\S+\\b', '', s) # and remove them
    hasht_n = len(re.findall(r'#\w+\b', s)) # count hashtags
    s = re.sub(r'#\w+\b', '', s) # remove them
    handle_n = len(re.findall(r'@\w{1,15}\b', s)) # count handles
    s = re.sub(r'@\w{1,15}\b', '', s) # remove them
    s = re.sub('pic\\.twitter\\.com/\\w+\\b', '', s) # remove pictures. Counting them was useless
    s = re.sub('\\s+', ' ', s) #reducing multiple whitespaces to one
    s = s.lstrip(whitespace+punctuation+'\xa0'+chr(8230)) # removing possible whitespaces in front
    s = s.rstrip(whitespace+'\xa0') # and on the back
    l=''
    emoj_and_such = 0
    for ch in s:
        if ord(ch) < 8204:
            l += ch # keep a symbol if not emoji or pictogram or such
        else:
            emoj_and_such += 1 # counting emojis and pictograms
    comma_n = len(re.findall(',', s))
    exl_n =  len(re.findall('!', s))
    dash_n = len(re.findall('-', s))
    a_an_n = len(re.findall(r'\b[Aa]n?\b', s))
    the_n = len(re.findall(r'\b[Tt]he\b', s))
    # reduce a number of repeated symbols to no more than 2 
    l = re.sub(r'(.)\1\1+', r'\1\1', l)
    length = len(l)
    words = [len(w) for w in re.findall(r'\b\w+\b', l)]
    if len(words)==0:
        average_word = 0
    else:
        average_word = np.max(words)
    return l, url_n, hasht_n, handle_n, emoj_and_such, exl_n, comma_n, dash_n\
        , a_an_n, the_n, length, average_word


In [None]:
%%time
with mp.Pool(processes= mp.cpu_count()) as p:
    total_data['tuple'] = p.map(cleaning_and_counts, total_data.tweet)

I need to distribute the created features in separate columns. I chose the data type for intergers as compact as I could.

In [None]:
features = ("cleaned_tweet, url_n, hasht_n, handle_n, emoji_and_such, exl_n, " +
    "comma_n, dash_n, a_an_n, the_n, length, average_word").split(', ')

for i in range(len(features)):
    if i ==0:
        total_data[features[i]] = total_data.tuple.apply(lambda t: t[i])
    else:
        total_data[features[i]] = total_data.tuple.apply(lambda t: t[i]).astype(np.uint8)

Let us check the results and do some memory cleaning.

In [None]:
print(total_data.columns)
total_data.drop(['tuple'], axis=1,inplace=True)
del full_ru_trolls_en, celebs, sentiment140
gc.collect()

I would like to drop accounts which have less than 10 posts because I analize account activities, and 1 post does not give much information about it.

In [None]:
min_count = 10
acc_properties = total_data[['account', 'troll']].groupby(['account'])\
    .agg(tweet_count=('account', 'size'),troll = ('troll','min'))\
    .reset_index()
kept_accs = acc_properties[acc_properties.tweet_count >= min_count]
restricted = total_data[total_data.account.isin(kept_accs.account)].copy(deep=True)
del total_data
num_cols = features[1:]
restricted.drop(['tweet', 'cleaned_tweet'], axis=1, inplace=True)

The function below computes percentiles for each count variable. I wish I could add it to the sklearn Pipeline, but I discovered that I cannot aggregate the output column because for estimator only the provided `y` value is used, and number of rows in `X` variables should be the same as `y` length, otherwise the Pipeline does not work.

Afterwards percentiles are computed and the whole dataset is shuffled.

In [None]:
%%time
def percentile_calc(data, groupby_col, num_cols, percentile_list):
    non_numeric = [col_name for col_name in data.columns if col_name not in num_cols]
    for qu in percentile_list: 
        percentiles = data.groupby(groupby_col).quantile(q=qu/100).reset_index()
        cols_to_change = {col : col +'_' + str(qu) for col in num_cols}
        percentiles.rename(columns=cols_to_change, inplace=True)
        if qu == percentile_list[0]:
            all_percentiles = percentiles
        else:
            all_percentiles = pd.merge(all_percentiles, percentiles, how = "left",\
                                       on = non_numeric)
    return all_percentiles

all_percentiles = percentile_calc(restricted[['account', 'troll']+num_cols], \
                                 groupby_col='account', num_cols=num_cols,
                                 percentile_list=range(10, 100, 10))
    
new_features = all_percentiles.columns[2:]
all_percentiles = shuffle(all_percentiles).reset_index(drop = True)

This is my customized scikit-learn Transformer for feature reduction using Mutual Information method and Pearson correlation, see Part 2 for explanation. The `X` variable must be a Pandas dataframe.  The Transformer drops one of the correlated columns while keeping the variable with higher Mutual Information in each correlated pair. There are options for minimal MI score, maximum correlation value and n_neighbors parameter for MI method.

In [None]:
def _drop_correlated(data, score_ordered_cols, max_corr, method='pearson'):
    new = [[score_ordered_cols[0]], [0]]
    corr_matrix = data[score_ordered_cols].corr(method).values
    N = len(score_ordered_cols)
    for i in range(1, N):
        tr = corr_matrix[new[1], i]
        if sum(np.abs(tr) > max_corr) == 0:
            new[0] += [score_ordered_cols[i]]
            new[1] += [i]
    return new[0]


class feature_reduction(BaseEstimator, TransformerMixin):
    def __init__(self, min_mi=.001, max_corr=.7, \
                 n_neighbors=11):
        self.min_mi = min_mi
        self.max_corr = max_corr
        self.n_neighbors = n_neighbors

    def fit(self, X, y):
        X = X.copy(deep=True)
        columns = X.columns
        mi = mutual_info_classif(X.values, y, n_neighbors= self.n_neighbors)
        cols_mi = list(zip(columns, mi))
        cols_mi.sort(reverse=True, key=lambda x: x[1])
        cols_mi = [pair[0] for pair in cols_mi if \
                   pair[1] > self.min_mi]
        new_cols = _drop_correlated(X[cols_mi], cols_mi, \
                                   max_corr=self.max_corr)
        self.selected_cols = new_cols
        return self

    def transform(self, X, y=None):
        return X[self.selected_cols]

Here go functions for printing out my Grid Search results.

In [None]:
def searchcv_best(estimator):
    print(" The best AUC score on a train set is:\n",
          estimator.best_score_)
    print(" The best parameters are:\n",
          estimator.best_params_)
def scores(y_test, y_pred):
    print('Here go metrics on a test set.')
    print("A confusion matrix is:")
    cm = confusion_matrix(y_test, y_pred)
    print(cm)
    print('accuracy:', accuracy_score(y_test, y_pred), 
          '\nf1:', f1_score(y_test, y_pred),
          '\nprecision:', precision_score(y_test, y_pred), 
          '\nrecall:', recall_score(y_test, y_pred), 
          '\nroc_auc:', roc_auc_score(y_test, y_pred))

And there is a function for the Pipeline with Grid Search. I decided to apply the CatBoost method, created in Russia, because I feel that this is ironic.

In [None]:
def train_pipe_cv(data, columns, parameters, fr_params, cbc_params,\
                  cv=5, train_size=.8, scoring='roc_auc'):
    X_train, X_test, y_train, y_test = train_test_split(data[columns], \
            data['troll'], train_size = train_size, stratify = data['troll'])
    fr = feature_reduction(**fr_params)
    cbc = CatBoostClassifier(**cbc_params)
    pipe = Pipeline([('mi_calc', fr), ('catboost', cbc)])
    search= GridSearchCV(pipe, param_grid=parameters, cv = cv, \
                         scoring= scoring,\
                         n_jobs=-1, verbose=1)
    search.fit(X_train, y_train)
    searchcv_best(search)
    y_pred = search.predict(X_test)
    scores(y_test, y_pred)
    return search.best_params_

The training takes several minutes.

In [None]:
%%time
pipe_params = {'mi_calc__max_corr': [0.7, 0.75],
               'mi_calc__n_neighbors': [5, 19],
               'catboost__depth': [9, 11]}

fr_params = {'min_mi': 0.0005}
cbc_params = {'loss_function': 'Logloss', 'logging_level': 'Silent',\
              'rsm': 0.25, 'l2_leaf_reg': 4, 'custom_metric': 'AUC'}

cv_best_params = train_pipe_cv(all_percentiles, new_features,\
                 parameters=pipe_params, fr_params=fr_params,\
                 cbc_params=cbc_params, scoring ='roc_auc')

We got a bit of overfitting here, but the test set accuracy is good nevertheless. 

I would like to note here that the MI method yields a score for all values of a variable, while a decision tree method splits a variable range and considers its behavior on a part of it. So a feature MI score may be low, but it can be helpful anyway for fine distinctions between our cases.

To determine what variables were most useful for our training we can check what were feature importances for the model training. Athough the Catboost method is randomized, and for each run we may get a slightly different order.  At first I need to add new learned parameters to the previous ones so I can supply them into my Transformer and CatBoost Estimator.

In [None]:
for key in cv_best_params:
    if key[:3] == 'cat':
        cbc_params[key.split('__')[1]] = cv_best_params[key]
    else:
        fr_params[key.split('__')[1]] = cv_best_params[key]
        
fr_cols = feature_reduction(**fr_params)
fr_cols.fit(all_percentiles[new_features], all_percentiles.troll)
new_cols = fr_cols.selected_cols

Here goes a list of features ordered by computed for this run importance. They permute a little bit with each re-run, although the top ones tend to stay higher in the list. 

In [None]:
cbc_best = CatBoostClassifier(**cbc_params)
cbc_best.fit(all_percentiles[new_cols], all_percentiles.troll.values)
var_importance = list(zip(cbc_best.feature_importances_, new_cols))
var_importance.sort(key=lambda x:x[0], reverse=True)
print("A number of variables is "+str(len(new_cols)))
var_importance

As we see, the most crucial  for troll detection is the way they stick to their instructions. In particular they should include specific words into their texts, so an average word length differs from what normal Americans use. They use a number of ways to increase their audience: links, hashtags, Twitter handles, and it shows. They apparently have guidelines for a post length. I consider it as convenient because we can set up filters for catching the most significant metrics, and then check a whole account activity. 
 
In addition the suggested features turned out to be not very dependable on languages but mostly on a paid troll account activity. Thus we can apply this work for other languages, and not  limit ourselves to Russian trolls posting English texts.