# File 05: Preprocessing User Timeline DataFrame

This file does exactly what you think it does. Preprocessing and a lot of it. Firstly we need to make sure the tweets we feed into the model to run prediction on are in the correct format. We also decided to put a limit on the number of tweets a user should have. Here we are only considering users which have tweets in the range of 100 to 200 as it should give us more accuracy while predicting its accuracy. We also remove any tweets which have less than 3 words in it. 

### Input Files:
- 03-user-tweets-english-only.csv

### Output Files:
- 05-shortlisted-tweets.csv
- 05-shortlisted-usernames.csv

### Steps:
1. loading required libraries
1. read user timeline tweets from dataframe
1. create functions that will preprocess the dataset
1. selecting only the tweets which have more than 2 words.
1. making a list of all usernames
1. counting tweets by each user
1. shortlisting users with tweet count between 100 and 200
1. making final list of tweets and users
1. creating dataframes
1. saving dataframes

In [1]:
# loading required libraries

import pandas as pd
import numpy as np
from tqdm import tqdm
import re

In [2]:
# read user timeline tweets from dataframe

df = pd.read_csv('db/03-user-tweets-english-only.csv')
user = df.USER.values.tolist()
tweet = df.TWEET.values.tolist()

In [3]:
# create functions that will preprocess the dataset

def remove_tags(text):
    TAG_RE = re.compile(r'<[^>]+>')
    return TAG_RE.sub('', text)

stopwords = set(["a", "about", "above", "after", "again", "against", "all", "am", "an", "and", "any", "are", "aren't", "as", "at", "be", "because", "been", "before", "being", "below", "between", "both", "but", "by", "can't", "cannot", "could", "couldn't", "did", "didn't", "do", "does", "doesn't", "doing", "don't", "down", "during", "each", "few", "for", "from", "further", "had", "hadn't", "has", "hasn't", "have", "haven't", "having", "he", "he'd", "he'll", "he's", "her", "here", "here's", "hers", "herself", "him", "himself", "his", "how", "how's", "i", "i'd", "i'll", "i'm", "i've", "if", "in", "into", "is", "isn't", "it", "it's", "its", "itself", "let's", "me", "more", "most", "mustn't", "my", "myself", "no", "nor", "not", "of", "off", "on", "once", "only", "or", "other", "ought", "our", "ours", "ourselves", "out", "over", "own", "same", "shan't", "she", "she'd", "she'll", "she's", "should", "shouldn't", "so", "some", "such", "than", "that", "that's", "the", "their", "theirs", "them", "themselves", "then", "there", "there's", "these", "they", "they'd", "they'll", "they're", "they've", "this", "those", "through", "to", "too", "under", "until", "up", "very", "was", "wasn't", "we", "we'd", "we'll", "we're", "we've", "were", "weren't", "what", "what's", "when", "when's", "where", "where's", "which", "while", "who", "who's", "whom", "why", "why's", "with", "won't", "would", "wouldn't", "you", "you'd", "you'll", "you're", "you've", "your", "yours", "yourself", "yourselves"])

def remove_single_chars(text) :
    array = text.split()
    return (" ".join([w for w in array if len(w) > 1]))

def remove_stopwords(text) :
    text = " ".join([word for word in text.split() if word not in stopwords])
    return text

def preprocess_text(sen) :
    sentence = remove_tags(sen)
    sentence = re.sub('@[A-Za-z]+[A-Za-z0-9-_]+', '', sentence)
    sentence = re.sub(r"http\S+", "", sentence)
    sentence = re.sub('[^a-zA-Z]', ' ', sentence)
    sentence = re.sub('/\b\S\s\b/', "", sentence)
    sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence)
    sentence = re.sub(r'\s+', ' ', sentence)
    sentence = remove_stopwords(sentence)
    sentence = remove_single_chars(sentence)
    return sentence

In [4]:
# selecting only the tweets which have more than 2 words.

user_new = []
tweet_new = []

for index in tqdm(range(len(df))):
    text = preprocess_text(tweet[index])
    if len(text.split()) > 2 :
        user_new.append(user[index])
        tweet_new.append(text)

100%|████████████████████████████████████████████████████████████████████| 953285/953285 [00:19<00:00, 48498.17it/s]


In [6]:
# making a list of all usernames

username = []
for x in tqdm(range(len(user_new))):
    if user_new[x] not in username :
        username.append(user_new[x])

100%|████████████████████████████████████████████████████████████████████| 868770/868770 [00:33<00:00, 26087.15it/s]


In [15]:
# counting tweets by each user

np_user = np.array(user)
tweetcount = []

for searchval in tqdm(username) :
    lst = list(np.where(np_user == searchval)[0])
    tweetcount.append(len(lst))

100%|██████████████████████████████████████████████████████████████████████████| 9159/9159 [00:50<00:00, 181.71it/s]


In [16]:
# shortlisting users with tweet count between 100 and 200

shortlist = []
for x in tqdm(range(len(username))) :
    if (tweetcount[x] >= 100) and (tweetcount[x] <= 200) :
        shortlist.append(username[x])

100%|██████████████████████████████████████████████████████████████████████| 9159/9159 [00:00<00:00, 3053221.30it/s]


In [19]:
# making final list of tweets and users

final_user = []
final_tweet = []

for x in tqdm(range(len(user_new))) :
    if user_new[x] in shortlist :
        final_user.append(user_new[x])
        final_tweet.append(tweet_new[x])

100%|████████████████████████████████████████████████████████████████████| 868770/868770 [00:22<00:00, 38502.17it/s]


In [20]:
# creating dataframes

final = pd.DataFrame(list(zip(final_user, final_tweet)), columns=['USER', 'TWEET'])
username = pd.DataFrame(shortlist, columns=['USER'])

In [27]:
# saving dataframes

final.to_csv('db/05-shortlisted-tweets.csv', index=False)
username.to_csv('db/05-shortlisted-usernames.csv', index=False)