# Tweets Preprocessing
**Date:** 09/09/2020                                    
*Version: 1.0*


## 1. Introduction
Extract data from semi-structured text files. The data-set that contains information about COVID-19 related tweets was provided. Each text file contains information about the tweets, and the task is to extract the data and transform the data into the XML format.

## 2. Import libraries

In [1]:
import os
import re
import langid
import pandas as pd
import nltk
from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords
from nltk.tokenize import MWETokenizer
from nltk.collocations import *
from itertools import chain
import itertools

## 3. Parsing Text Files

### 3.1. Examining and loading data

The first step is identify the location of all the txt files that are going to be processed. The location of those files is in a folder called part1 that is in the same location of this notebook. After we found them, we created a list with the names of all of the files to be processed.

In [2]:
filenames = [] # List to store the name of the files to be processed
for filename in os.listdir(os.path.join(os.getcwd(), 'tweets/')): # Get every file in the folder mentioned
    filenames.append(filename)

print(filenames[:5]) # Print the name of the first 5 files to be processed
print('\nNumber of files to be processed: '+ str(len(filenames))) # Print the number of files to be processed

['2020-03-22_1066.txt', '2020-03-22_1311.txt', '2020-03-22_1363.txt', '2020-03-22_1412.txt', '2020-03-22_1551.txt']

Number of files to be processed: 2416


Then we take a look to the information stored in this files

In [4]:
# print first two lines of the first file in the list filenames
with open(os.path.join(os.getcwd(), 'tweets/', filenames[0]), 'r', encoding="utf8") as file:
    print('\n'.join([file.readline().strip() for i in range(0,1)]))



### 2. Parsing Tweets (JSON Format)
After we manage to extract the text of every file given, we have to parse the text with Regular Expressions. Each tweet has the following structure:
    1. id: A 19-digit number.
    2. text: The text of the tweet.
    3. Created_at: The date and time that the tweet was created
    
    
In the following code we test the Regular Expressions selected to extract the information in the first file of the list filenames. 

#### 2.1 Regular Expressions:

   - **Text:** 
   
   $\text{"text":(.*?)(?:"created_at":"|"id":"|"errors":|"withheld":)}$
   
       A capturing group that gets every text between **"text":"** and any of the following statements **"created_at":" or "id":" or "errors": or "withheld":** by using the lazy quantifier ? we make sure that we get only the text of one tweet as it will stop when the first statement found.
       
<img src="./images/text.png" width="500" height="250">
 
   - **Created_at:** 
   
   $\text{"created_at":"(.*?)T}$
   
       A capturing group that gets every text between **"created_at":"** and **T** by using the lazy quantifier ? we make sure that we get only the text of one tweet as it will stop when the first **T** is found. We are not interested in getting the time, that is why we get rid of it and we stop when the first **T** is found.


<img src="./images/created_at.png" width="500" height="250">
       
   - **ID:** 
   
   $\text{"id":"(\d{19})}$
   
       A capturing group that gets every 19 digits that are after **"id":"**

<img src="./images/id.png" width="300" height="150">


#### 2.2 Fixing text of the tweets

##### 2.2.1 Strip special combinations of characters
After we parse the text of the tweets with this regular expression: $\text{"text":"(.*?)(?:"created_at":"|"id":"|"errors":)}$, we have to get rid of combinations of special characters. There are three possible combinations:

> **",**  |   
If the text is the first or the second attribute of the JSON object {"text":"*tweet*,"created_at"} or {"text":"*tweet*,"id"}, there will be this combination to be removed

> **"},{**  |
If the text is the last attribute of the JSON object {"created_at", "text":"*tweet*} or {"id", "text":"*tweet*}, and there is another JSON object next to it there will be this combination to be removed

> **"}],**  |
If the text is the last attribute of the JSON object {"created_at", "text":"*tweet*} or {"id", "text":"*tweet*}, and there is not another JSON object next to it there will be this combination to be removed

##### 2.2.2 Escaping special XML characters:

There are some special characters that must be replaced in order to make the text readable in a XML format, these characters are in the table showed in the table:

|Special character|	escaped form|	gets replaced by|
:--:|:--:|:--:|
|Ampersand|	$\text{&amp;}$ |	&|
|Less-than|	$\text{&lt;}$|	<|
|Greater-than|	$\text{&gt;}$|	>|
|Quotes|	$\text{&quot;}$	|"|
|Apostrophe	|$\text{&apos;}$	|'|


##### 2.2.3  Surrogate Pairs:
We have to encode the text as follows in order to get the text exactly as it was posted and not getting codes for emojis or blank spaces.
```python
    eval(string).encode('utf-16', 'surrogatepass').decode('utf-16')
```

$ $

The text parsed from the .txt file needs to be fixed with these three thing mentioned above, that is why a funtion is created

In [4]:
def fix_text(str):
    # Remove special combination of characters at the end of the text
    str = str.rstrip(',') # Scenario 1
    str = str.rstrip('},{') # Scenario 2
    str = str.rstrip('}]') # Scenario 3
    
    # Surrogate pairs
    str = eval(str).encode('utf-16', 'surrogatepass').decode('utf-16')
    
    # Replace escaping special XML characters
    str = str.replace("&", "&amp;")
    str = str.replace("<", "&lt;")
    str = str.replace(">", "&gt;")
    str = str.replace('"', "&quot;")
    str = str.replace("'", "&apos;")
    

    return str

In [5]:
# Parse the first file in the list filenames and print the information of the first tweet
with open(os.path.join(os.getcwd(), 'part1/', filenames[0]), 'r', encoding="utf8") as file: # Open the file that is in a folder called part1
    file_text = file.read() # Get the text of the file
    text = re.findall('"text":(.*?)(?:"created_at":"|"id":"|"errors":|"withheld":)', file_text) # Get the text after "text":" and before ""created_at":", "id":", "errors": or "withheld":
    created_at = re.findall('"created_at":"(.*?)T', file_text) # Get the text after "created_at":" and before T
    id_ = re.findall('"id":"(\d{19})', file_text) # Get 19 numbers after "id":"
    
# Print the information parsed of the fist 3 tweets. To check if the Regular Expressions are working properly
for i in range(3):
    tweet = fix_text(text[i]) # Fix the text parsed with the XML format and encoded with utf-16
    print('Tweet ' + str(i+1) +':\n  Text: ' + tweet + "\n  Created at: " + created_at[i]+'\n  ID: ' + id_[i]+'\n\n')
    
print('There are ' + str(len(text)) + " tweets to be parsed in the first file of filenames")

Tweet 1:
  Text: The last line though ..what a tone.. https://t.co/TqH0DBWhB4
  Created at: 2020-03-22
  ID: 1241636537961517056


Tweet 2:
  Text: 👇 https://t.co/sQmCOVaL3l
  Created at: 2020-03-22
  ID: 1241636537990664193


Tweet 3:
  Text: @BDUTT https://t.co/Vylmpnh8c4
  Created at: 2020-03-22
  ID: 1241636538061996038


There are 84 tweets to be parsed in the first file of filenames


### 4. Filtering English Tweets
By using the library 'Langid' we can filter out the non-english tweets, as we are only interested in those that are written in English. We have to identify the language of each tweet and then classify it, in the following example we extract the language of the first 10 tweets of the first file in filenames

In [6]:
en_tweets = []
date_en_tweets = []
id_en_tweets = []
for i in range(10):
    tweet = fix_text(text[i]) # Fix the text parsed with the XML format and encoded with utf-16
    if langid.classify(tweet)[0] == 'en':
        print('English tweet: ' + tweet + '\n')
        en_tweets.append(tweet)
        date_en_tweets.append(created_at[i])
        id_en_tweets.append(id_[i])
    else:
        print('Non-english tweet: ' + tweet + '\n')

English tweet: The last line though ..what a tone.. https://t.co/TqH0DBWhB4

Non-english tweet: 👇 https://t.co/sQmCOVaL3l

English tweet: @BDUTT https://t.co/Vylmpnh8c4

Non-english tweet: #Israel: #coronavirus
▫️945 enfermos 
▫️1 muerte

Hasta hoy, no se ha aplicado cuarentena obligatoria, se ha apelado a la responsabilidad individual, se han restringido desplazamientos y a transgresores, aplicación de fuertes multas.

#coronavirus 
#COVID19

English tweet: @TwitterSupport. This is dangerous disinformation that will lead to deaths. Please remove this tweet. https://t.co/eRHkanb4RA

English tweet: Sebi na one President be this https://t.co/zrl9zXeZ71

Non-english tweet: *cue eerie music* https://t.co/mTH0JmhayG

Non-english tweet: Che presa per il culo... #Coronavirus, ecco l’#Italia che non si ferma: quali sono i servizi essenziali - Il Sole 24 ORE @sole24ore https://t.co/T6pBLn4cHf

English tweet: This tweet didn’t age well. https://t.co/4o3WZBTULZ

English tweet: @OFB_India #JantaCu

In [7]:
print('\nOnly ' + str(len(en_tweets)) + ' tweets out of 10, are written in English')


Only 6 tweets out of 10, are written in English


## 5. Store the tweets in a dictionary with XML format
After parsing all the .txt files given and filtering out the non-english tweets, we proceed to store the information in a dictionary where the keys are the dates, and the values are going to be a list with all the tweets written this date.
The dates and the texts stored in the dictionary are going to have the XML format as follows:

#### Date:

    <tweets date= "(yyyy-mm-dd)">
    
#### Text:

    <tweet id="(id)">(text of the English tweet)</tweet>

In [8]:
twetts_dict = {}
for i in range(len(en_tweets)):
    xml_date = '<tweets date="' + date_en_tweets[i] +'">'
    xml_tweet = '<tweet id="' + id_en_tweets[i] + '">' + en_tweets[i] + '</tweet>'

    if xml_date in twetts_dict:
        twetts_dict[xml_date].append(xml_tweet)
    else:
        twetts_dict[xml_date] = [xml_tweet]
twetts_dict

{'<tweets date="2020-03-22">': ['<tweet id="1241636537961517056">The last line though ..what a tone.. https://t.co/TqH0DBWhB4</tweet>',
  '<tweet id="1241636538061996038">@BDUTT https://t.co/Vylmpnh8c4</tweet>',
  '<tweet id="1241636538192023553">@TwitterSupport. This is dangerous disinformation that will lead to deaths. Please remove this tweet. https://t.co/eRHkanb4RA</tweet>',
  '<tweet id="1241636538238349318">Sebi na one President be this https://t.co/zrl9zXeZ71</tweet>',
  '<tweet id="1241636541882982400">This tweet didn’t age well. https://t.co/4o3WZBTULZ</tweet>',
  '<tweet id="1241636542067527681">@OFB_India #JantaCurfewMarch22 @narendramodi\nTo join the fight against coronavirus, I have taken the Janta Curfew Pledge and committed myself towards keeping the country safe. You can also take the pledge at https://t.co/eTCtdnmXA0 https://t.co/UDPNW1LhnN via @mygovindia</tweet>']}

## 6. Get code together and Store the information
The process shown above is written all together in a code shown below, after this to store the information parsed in the dictionary *twetts_dict* with the XML Format.

In [9]:
twetts_dict = {} # Dictionary that will store the information parsed from the .txt files

for filename in os.listdir(os.path.join(os.getcwd(), 'part1/')): # Get every file in the folder where is this notebook

    with open(os.path.join(os.getcwd(), 'part1/', filename), 'r', encoding="utf8") as file: # Open the file 
        file_text = file.read() # Get the text of the file
        text = re.findall('"text":(.*?)(?:"created_at":"|"id":"|"errors":|"withheld":)', file_text) # Get the text after "text":" and before ", or "}
        created_at = re.findall('"created_at":"(.*?)T', file_text) # Get the text after "created_at":" and before T
        id_ = re.findall('"id":"(\d{19})', file_text) # Get 19 numbers after "id":"

        for i in range(len(text)): # For every match of text captured with the regular expressions above
            
            tweet = fix_text(text[i]) # Fix the text parsed with the XML format and encoded with utf-16
            if langid.classify(tweet)[0] == 'en': # If the tweet is written in English
                xml_date = '<tweets date="' + created_at[i] +'">' # XML Format of the date (key)
                xml_tweet = '<tweet id="' + id_[i] + '">' + tweet + '</tweet>' # XML Format of the id and text (Value)

                if xml_date in twetts_dict: # If the date is already stored in the dictionary
                    twetts_dict[xml_date].append(xml_tweet) # Append a new tweet related to this date
                else: 
                    twetts_dict[xml_date] = [xml_tweet] # Create a new key and value

## 7. Write XML File
Finally, with the information stored in *twetts_dict* we create a XML File called **30550971.xml** with the structure showed below, and it will be stored in the same folder as this notebook.

In [10]:
xml_file = open('30550971.xml', 'w', encoding='utf8') 
xml_file.write('<?xml version="1.0" encoding="UTF-8"?>')
xml_file.write('<data>')

for date in twetts_dict:
    xml_file.write(date)
    for tweet in twetts_dict[date]:
        xml_file.write(tweet)
    
    xml_file.write('</tweets>')
    
xml_file.write('</data>')
xml_file.close()

## References

[Open files with os library](https://stackoverflow.com/questions/18262293/how-to-open-every-file-in-a-folder)

[Open only .txt files in a folder](https://stackoverflow.com/questions/3964681/find-all-files-in-a-directory-with-extension-txt-in-python)

[Classify english tweets](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)

[Create a new XML file](https://www.w3schools.com/python/python_file_write.asp)

[Create a new line in a XML file](https://www.kite.com/python/answers/how-to-append-a-newline-to-a-file-in-python#:~:text=Use%20file.,append%20a%20newline%20to%20file%20.)

[Escape especial charecters in XML](https://stackoverflow.com/questions/1546717/escaping-strings-for-use-in-xml)