# Content
1. [Intoduction](#Intoduction)
1. [Initializing](#Initializing)
1. [Describing the data](#Describing-the-data)
1. [Preprocessing the text](#Preprocessing-the-text)
1. [Making ngram model](Making-ngram-model)
1. [Making the generator](Making-the-generator)
1. [Testing the generator](#Testing-the-generator)
1. [Conclusion](#Conclusion)


# Intoduction
Hello, here I whant to show you my attempt to generate covid tweets using **ngram models**, trained on [COVID19 Tweets](https://www.kaggle.com/gpreda/covid19-tweets) dataset.

# Initializing
In this project we will need
* **pandas** - to read the data
* **re** - to preprocess the text 
* **defaultdict** from collections - to create ngram models 
* **random** - to generate tweets

Let's import packages and read csv:

In [None]:
import pandas as pd
import re
from collections import defaultdict
import random

df = pd.read_csv('/kaggle/input/covid19-tweets/covid19_tweets.csv')

# Describing the data
The dataset has information about tweets and their authors. It's author [Gabriel Preda](https://www.kaggle.com/gpreda) searched tweets with hashtags refering to covid.

To generate tweets we will need only "text" column.

In [None]:
df.head()

# Preprocessing the text

## Helper function
To create ngram models we need to clean the sentences. Let's create function, which has **"text"** parameter and wich returns list of clean sentences.

It wil:
1. lower the case
1. delete links
1. delete special symbols, e.g. %, ^, &
1. delete line breaks
1. remove unnecessary whitespaces
1. split tweet text by punctuation marks

In [None]:
def preprocess_text(text: str):
    """
    
    :param text: - tweet text
    :returns: - list of preprocessed sentences of a tweet.
    """
    
    # lower capital letters
    text = text.lower()
    
    # delete links
    text = re.sub(r'https.+?', '', text)

    
    # delete everything except punctuation marks
    text = re.sub(r'[^a-z !?.\n]', '', text)
    
    # remove whitespace before punctuation mark/whitespace/end of line
    text = re.sub(r' (\?|\!|\.|\n| |$)', r'\1', text)

    
    # remove whitespace at the begining of the line or after punctuation mark
    text = re.sub(r'(\?|\!|\.|\n|^) ', r'\1', text)
    
    # spliting by puncuation mark or new line
    texts = re.split('\?|\.|\!|\n', text)
    
    # deleting empty sentences
    texts = [t for t in texts if t]
    

    return texts

### Test of preprocessing:

In [None]:
test_tweet = 'BiG, brother is 100% watching  you .Are You scared???'
preprocess_text(test_tweet)

## Creating text corpus

In [None]:
corpus = []
for t in df['text']:
    corpus.extend(preprocess_text(t))

In [None]:
for c in corpus[:10]:
    print(c)

# Making ngram model
So, here we will make ngram model using default dict. It's a normal dict, but if you try to call undefined key from it, it defines it for you with the default value.

The keys of our ngram dict will be a tuple of tokens - list of N words, which go in their order.

Also i whanted to set some parameters:
1. **N** - This will create ngrams of N size
1. **min_count** - it will exclude such ngrams, which were found in text less than than min_count value
1. **min_tokens** - it will add ngram only if there are more than *"min_tokens"* count of words in sentence

In [None]:
def create_ngrams(N: int, min_count: int = 5, min_tokens: int = 5):
    
    """
    :param N: ngram size
    :param min_count: minimum acceptable count of ngram founds
    :param min_tokens: minimum acceptable count of tokens in sentence
    
    :returns: Dict[Tuple[str], int] 
    """
    ngram = defaultdict(int)
    
    for line in corpus:
        tokens = line.split()
        
        if len(tokens) > min_tokens:
            for i in range(len(tokens) - N + 1):
                ngram[tuple(tokens[i : i + N])] += 1
                
            ngram[tuple(['^'] + tokens[0 : N - 1])] += 1
            
            ngram[tuple(tokens[-N:-1] + ['$'])] += 1
            
    ngram = {key: value for key, value in ngram.items() if value > min_count}
    
    return ngram
    

### Lets create a trigram model

In [None]:
trigrams = create_ngrams(3)

In [None]:
len(trigrams)

In [None]:
list(trigrams.items())[:10]

You might can see special symbols:
* **^** - is used to mark the begining of a sentence 
* **\$** - is used to mark the end of a sentence 

# Making the generator

Now let's create function, which will give us the nex token (word) by previous_tokens.

In [None]:
def get_next_token(previous_tokens: tuple, ngrams: dict, method: str = 'random'):
    """
    Returns the next token if found one
    
    :param previous_tokens: previous tokens
    :param ngrams: ngrams - ngrams, where to search for the next token
    :param method: method of search. 
        'random' - searches for the random token. 
        'weighted' - searches for random token, but token wich was found more times will have more probability to be searched
        'most_common' - searches for the most common token
        
    :returns: str
    """
    matching_ngrams = {key: value for key, value in ngrams.items() if key[:2] == previous_tokens}
    
    if matching_ngrams:
        
        if method == 'random':
            return random.choice(list(matching_ngrams.keys()))[-1]
        
        elif method == 'weighted':
            return random.choices(list(matching_ngrams.keys()), weights=list(matching_ngrams.values()))[0][-1]
        
        elif method == 'most_common':
            return sorted(matching_ngrams.items(), key=lambda x: x[1])[-1][0][-1]

### Test it

In [None]:
previous_tokens = ('in', 'the')

print('random:')
for _ in range(3):
    print('\t', get_next_token(previous_tokens, trigrams, method='random'))
    
print('weighted:')
for _ in range(3):
    print('\t', get_next_token(previous_tokens, trigrams, method='weighted'))

print('most_common:')
for _ in range(3):
    print('\t', get_next_token(previous_tokens, trigrams, method='most_common'))

## Starting the tweet
Now, let's create function, which will start the tweet. It will search for ngrams starting with "^" symbol and return the ngram due to "method"

In [None]:
def get_starter(ngrams, method='random'):
    starters = {key: value for key, value in ngrams.items() if key[0] == '^'}

    if method == 'random':
        return random.choice(list(starters.keys()))
    
    elif method == 'weighted':
        return random.choices(list(starters.keys()), weights=list(starters.values()))[0]
    
    elif method == 'most_common':
        return sorted(starters.items(), key=lambda x: x[1])[-1][0]

### Test it

In [None]:
print('random:')
for _ in range(3):
    print('\t', get_starter(trigrams, method='random'))
    
print('weighted:')
for _ in range(3):
    print('\t', get_starter(trigrams, method='weighted'))

print('most_common:')
for _ in range(3):
    print('\t', get_starter(trigrams, method='most_common'))

## Creating a tweet
Now, let's write a function, which will create a tweet. If the length of created tweet is less than "min_length", it regeneretes it untill everything is ok.

In [None]:
def generate_tweet(ngrams, min_length: int = 10, starter_method='random', next_method='most_common'):
    
    N = len(list(ngrams.keys())[0])
    
    starter = get_starter(ngrams, starter_method)
    tokens = list(starter)[1:]
    
    next_token = get_next_token(tuple(list(starter)[1:]), ngrams, next_method)
    while next_token:
        tokens.append(next_token)
        
        last_tokens = tuple(tokens[-N+1:])
        next_token = get_next_token(last_tokens, ngrams, next_method)
    
    if len(tokens) < min_length or tokens[-1] != '$':
        tweet = generate_tweet(ngrams, min_length, starter_method, next_method)
    else:
        tweet = ' '.join(tokens)
    return tweet 

# Testing the generator

Here are some generated tweets:

In [None]:
random.seed(43)
tweets = [generate_tweet(trigrams) for _ in range(10)]
for tweet in tweets:
    print('*', tweet)

Let's look if your generated tweets have matches

In [None]:
for tweet in tweets:
    print(tweet)
    for text in df['text']:
        if tweet[:-1] in text:
            print('\t', text.replace('\n', ' ').replace('\t', ''))
            
    else:
        print('\t--No matches')
    print()

# Conclusion
We have created a text generator working with ngram models and trained it on covid tweets.
The result is not so overwhelming, there is a lack of sence in generated tweets. I think, the dataset is too small for this task and ngram model is not the thing we need.
But we had fun :)