## Loading Basic Data
Loading a text into python's environment depends largely on the input itself. Following are common ways of storing corpora on disc
* Collection of text files in a folder/multiple folders
* A single file with each document in a new line
* Column(s) of an RDBMS

Let's see a couple of examples. 

### Loading a text with one document per line

In [None]:
with open('data/tweets.txt', 'r') as f:
    tweets = f.read()
    tweets = tweets.split('\n')
    #above two lines could be replace with `tweets = f.readlines()`
    
print(tweets[:3])

### Loading a collection of texts along with their class

In [None]:
import glob, os
os.chdir('data/sample_data/') #change directory to where the folders are
folders = glob.glob('*') #load all the folder names into a list
# print(folders)

all_texts = []
all_categories = []

for folder in folders:
    print('importing text files from "{}" folder...'.format(folder), end=' ')
    
    files_in_folder = glob.glob(folder+'/*.txt')
    
    for _file_ in files_in_folder:
        with open(_file_, 'r') as f:
            text_in_file = f.read()
            all_texts.append(text_in_file)
            all_categories.append(folder)
            
    print('found {} files'.format(len(files_in_folder)))
        
os.chdir('../..') #revert back to original working directory

## Basic Preprocessing
Let's load the nltk module and do the following on each text in the `all_texts` list.

* Convert everything to lower
* Tokenize into sentences
* Word tokenize each sentence
* Remove stop words

In [None]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

stopwords = nltk.corpus.stopwords
eng_stopwords = stopwords.words('english')

def basic_preprocessing(text):
    text = text.lower()
    sentences = sent_tokenize(text)
    
    tokenized_sentences = []
    
    for sentence in sentences:
        words = word_tokenize(sentence)
        words = [word for word in words if word not in eng_stopwords]
        tokenized_sentences.append(words)
    return(tokenized_sentences)

## Testing
basic_preprocessing('''This is a test sentence. Second sentence is longer than the previous sentence. 
NOTICE HOW THE CAPTIALS ARE CONVERTED TO SMALL CASE and how the common words will be removed''')

### List comprehension (Python detour)
Notice the statement 
```
words = [word for word in words if word not in eng_stopwords]
```
We used a combination of `for` and `if` in a single list. This is called a list comprehension. The same code could have been written as follows
```python
tmp = []
for word in words:
    if word not in eng_stopwords:
       tmp.append(word)
tmp = words
```
The way to read a list comprehension would be 
```python
[f(x) for x in list]
```  
This means, for every item `x` in the `list`, apply the function `f(x)` and return another list. In our case the function was to return the word itself.

`if` is optional. With an if condition the statement would look 
```python
[f(x) for x in list if g(x) else h(x)]
```

In our case g(x) was 

```python
word not in eng_stopwords
```
and we skipped h(x) as we don't want the stop word

In [None]:
processed_texts = [basic_preprocessing(text) for text in all_texts]
print(processed_texts[0])

### Frequency Count

In [None]:
from nltk import FreqDist
string='''
At Waterloo we were fortunate in catching a train for Leatherhead, where we hired a trap at the station inn and drove for four or five miles through the lovely Surrey lanes. 
It was a perfect day, with a bright sun and a few fleecy clouds in the heavens. 
The trees and wayside hedges were just throwing out their first green shoots, and the air was full of the pleasant smell of the moist earth. To me at least there was a strange contrast between the sweet promise of the spring and this sinister quest upon which we were engaged. 
My companion sat in the front of the trap, his arms folded, his hat pulled down over his eyes, and his chin sunk upon his breast, buried in the deepest thought. 
Suddenly, however, he started, tapped me on the shoulder, and pointed over the meadows.
'''

freqs = FreqDist(word_tokenize(string))
print(freqs)

print(freqs.most_common(10))

### Document Similarity

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer() #blank tfidf calculator
tfidf.fit(tweets) #load the data and store the tf and idf values
train_tfidf = tfidf.transform(tweets) #create the tfidf matrix
print(train_tfidf.shape)

In [None]:
tweets[0]

Let's compare the first tweet with everything else and see which one it is closest to

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

for row_num, row in enumerate(train_tfidf):
    print('Similarity between tweet #{} and #{} is: {:.3f}'.format(1, row_num+1, cosine_similarity(train_tfidf[0], row)[0][0]))

It does look like tweet-5 is talking about something similar

In [None]:
tweets[0], tweets[5]

Let's load our own text and compare. This is similar to a query we give and retreive documents

In [None]:
test_tfidf = tfidf.transform(['banks business process blah blah'])

In [None]:
for row_num, row in enumerate(train_tfidf):
    print('Similarity between test tweet and tweet #{} is: {:.3f}'.format(row_num, cosine_similarity(test_tfidf[0], row)[0][0]))

## Activity

Now that you understand loading, preprocessing and regular expressions, load the data `cognizant_tweets.txt` and do the following 
1.	Find the 20 most common words in the corpus  
    a.	Excluding the stop-words  
    b.	Including the stop-words  
2.	Find the 10 most common hash-tags in the corpus
3.	Return a corpus (list of strings) which contains no hashtags or the urls. Hint: Use `word.startswith('string')` method to see if a word has to be removed.  
    a.	(Regex activity - optional) There are a lot of tweets where the letter '&' is written as '&amp;'. Replace the latter with former using re.sub function during pre-processing.
4.	What percent of the tweets start with 'RT'?
5.  Which tweet is the 5th tweet most similar to? Print it.
