# File 03: Pre Processing Timeline Tweets

In this file we are supposed to filter out all the tweets that are not in english. Fortunately or Unfortunately, while fetching tweets in the last file, we collected around 1.4 million tweets. Firstly we remove all the repeated tweets. This brings the number down to around 1.17 million. 1.17 million tweets is still a lot.

Looking at the task manager we notice that we are not utilizing all the resources available on our machine. Out of 24 available threads on our main machine, only 4 were being utilized. To take advantage of rest of the threads, we would make use of 'multiprocessing' functions available in python. Python's 'Concurrent Futures' library has greatly simplified the process of multiprocessing and multithreading any function. Just by providing a function and a list of inputs, it can automagically handle all the thread management on its own.

There was one limitation of technique though. Multiprocessing did not allowed us to return an array of indexes for the tweets that were in 'english'. To mitigate this issue we decided to save all the indexes in text files and later load them and stitch together another dataframe containing only enlgish tweets. This datafram was later stored as '03-user-tweets-english-only.csv'



### Input File:
- 02-user-tweets.csv ------------------> Contains: Usernames, Tweets(From Timeline)

### Output File:
- 03-user-tweets-english-only.csv -----> Contains: Username, Tweets(From Timeline, only English)

### Steps:
1. Load '02-user-tweets.csv' dataframe
2. Make a function that would create pairs of start and end point, dividing the dataframe into smaller chunks
3. Create a function 'process' that:
    1. Loads tweets from the dataframe within a certain provided range
    2. Checks the Tweets for their language
    3. Saves only the indexes of tweets which are in english
    4. Save the array as a '.txt' file in 'database' directory
4. User Multiprocessing to process all the tweets in the dataframe using 'process' function
5. Load all files from 'database' directory
6. Read indexes of all the english tweets from the saved files
7. Save all of these 'tweets + username' pair in another dataframe
8. Save the new dataframe as '03-user-tweets-english-only.csv'

In [2]:
# Loading all required libraries

import concurrent.futures
import math
import pandas as pd
from tqdm import tqdm
from collections import Counter
import langid
from nltk.classify.textcat import TextCat
from langdetect import detect
import pickle
import glob
import re

In [14]:
# load '02-user-tweets.csv' dataframe

df = pd.read_csv("db/02-user-tweets.csv")
df = df.drop_duplicates()
df1 = pd.DataFrame(columns=['user', 'tweet'])
database = df.values.tolist()
output = []

In [15]:
len(df)

1171489

In [3]:
# create a function to spilt data into small chunks
def make_pairs(end=1400000, divs=100000) :
    output = []
    var = 0
    while (var < end) :
        lst = [var, var + divs]
        output.append(lst)
        var = var + divs

    if (output[-1][1] > end) :
        remove = output.pop()
        output[-1][1] = end

    return output

In [10]:
# divide entire 'database' into chunks of size 15000
div_size = 15000
PAIRS = make_pairs(len(database), div_size)
print(PAIRS)

[[0, 15000], [15000, 30000], [30000, 45000], [45000, 60000], [60000, 75000], [75000, 90000], [90000, 105000], [105000, 120000], [120000, 135000], [135000, 150000], [150000, 165000], [165000, 180000], [180000, 195000], [195000, 210000], [210000, 225000], [225000, 240000], [240000, 255000], [255000, 270000], [270000, 285000], [285000, 300000], [300000, 315000], [315000, 330000], [330000, 345000], [345000, 360000], [360000, 375000], [375000, 390000], [390000, 405000], [405000, 420000], [420000, 435000], [435000, 450000], [450000, 465000], [465000, 480000], [480000, 495000], [495000, 510000], [510000, 525000], [525000, 540000], [540000, 555000], [555000, 570000], [570000, 585000], [585000, 600000], [600000, 615000], [615000, 630000], [630000, 645000], [645000, 660000], [660000, 675000], [675000, 690000], [690000, 705000], [705000, 720000], [720000, 735000], [735000, 750000], [750000, 765000], [765000, 780000], [780000, 795000], [795000, 810000], [810000, 825000], [825000, 840000], [840000,

In [None]:
# create function to filter out all tweets which are not in english
def process(endpoints):
    id = 0
    value = endpoints[0]
    
    while (value > 0) :
        value = value - div_size
        id = id + 1

#     yes = 0
#     no = 0
    array = []
    for index in range(endpoints[0], endpoints[1]) :
            try :
                # database[index][1] = preprocess_text(database[index][1])
                if detect(database[index][1]) == 'en' :
                    yes = yes + 1
                    array.append(index)
                else :
                    no = no + 1
            except :
                no = no + 1

    # print(f"{id} Processing Done : {yes} Tweets Detected")
    with open(f'dataframe/{id}.txt', 'wb') as fo:
        pickle.dump(array, fo)
        print(f"{id}.txt Created!")

In [None]:
# funtion that would use multiprocessing...
def main():
    with concurrent.futures.ProcessPoolExecutor() as executor:
        executor.map(process, PAIRS)

In [None]:
# function to read all saved file and combine into one single datafram
def not_main() :
    files = glob.glob('dataframe/*.txt')
    index = []
    for file in files:
        with open(file, 'rb') as fo:
            obj = pickle.load(fo)
            for row in obj :
                index.append(row)
    index = sorted(index)
    final = []
    for num in tqdm(index):
        final.append(database[num])
    final_db = pd.DataFrame(final, columns=['USER', 'TWEET'])
    final_db.to_csv('db/03-user-tweets-english-only.csv', index=False)

In [None]:
# executing the program
main()
not_main()