# Textual Pattern Analysis with Unigram and Bigram Models

In this project we perform a basic language processing task on a given piece of text, focusing on identifying and analyzing unigrams and bigrams within the text. Furthermore, it also assesses the likelihood of occurrence for a randomly generated string of words based on the computed bigram model.

The text selected for this project is:

`One use of sentence embeddings is information retrieval. Consider the task of searching the Snap! manual or this AI programming guide. String matching cannot take into account synonyms, different ways of saying the same thing, or different spelling conventions. In this sample search project sentence embeddings are used to compare the user's query with sentence fragments from the manual and guide. By relying on the features closest to the list of features block the closest fragments are found very quickly. The embeddings of all the fragments have been pre-computed so only the embedding of the user's query is needed.`

## Text Processing
First, we will preprocess the text. We will tokenize the text, which means that we will split it into individual words.

In [3]:
import nltk
from nltk import word_tokenize

# Our input paragraph
input_paragraph = "One use of sentence embeddings is information retrieval. Consider the task of searching the Snap! manual or this AI programming guide. String matching cannot take into account synonyms, different ways of saying the same thing, or different spelling conventions. In this sample search project sentence embeddings are used to compare the user's query with sentence fragments from the manual and guide. By relying on the features closest to the list of features block the closest fragments are found very quickly. The embeddings of all the fragments have been pre-computed so only the embedding of the user's query is needed."

# Tokenize the paragraph into words
tokens = word_tokenize(input_paragraph)

## Unigram and Bigram Calculation
Next, we calculate the unigrams and bigrams. Unigrams are just the individual words in the text, and bigrams are pairs of words that occur in sequence.

In [4]:
from nltk.util import ngrams

# Calculate unigrams
unigrams = list(ngrams(tokens, 1))

# Calculate bigrams
bigrams = list(ngrams(tokens, 2))


**display this in a table**

In [75]:
import pandas as pd

# Convert unigrams and bigrams to dataframes
unigram_df = pd.DataFrame(unigrams, columns=['word'])
bigram_df = pd.DataFrame(bigrams, columns=['word1', 'word2'])

# Displays unigram and bigram dataframes
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
print(unigram_df)
print(bigram_df)

             word
0             One
1             use
2              of
3        sentence
4      embeddings
5              is
6     information
7       retrieval
8               .
9        Consider
10            the
11           task
12             of
13      searching
14            the
15           Snap
16              !
17         manual
18             or
19           this
20             AI
21    programming
22          guide
23              .
24         String
25       matching
26            can
27            not
28           take
29           into
30        account
31       synonyms
32              ,
33      different
34           ways
35             of
36         saying
37            the
38           same
39          thing
40              ,
41             or
42      different
43       spelling
44    conventions
45              .
46             In
47           this
48         sample
49         search
50        project
51       sentence
52     embeddings
53            are
54        

## Random String Generation and Probability Calculation
Now we will generate a random string of words with a length less than 5. Then we'll calculate the probability of this string occurring in our text.

In [72]:
import random

random_string_prob = 0
while (random_string_prob == 0 or random_string_prob == 1):
    # Generate a random string of words with length less than 5
    random_string_length = random.randint(1, 4)
    random_string = random.sample(tokens, random_string_length)
    # Calculate the probability of the random string
    bigram_freq = nltk.FreqDist(nltk.bigrams(tokens))
    random_string_bigrams = list(ngrams(random_string, 2))
    random_string_prob = 1
    for bigram in random_string_bigrams:
        random_string_prob *= (bigram_freq[bigram] / len(bigram_freq))

print("Random string:", random_string)
print("Probability:", random_string_prob)

Random string: ['fragments', 'are']
Probability: 0.009523809523809525


## Description
**You can find the full details in `report.pdf`**

This project uses basic Natural Language Processing (NLP) techniques to analyze a piece of text. The nltk library in Python is a powerful tool for such text analysis tasks. The project could be extended by performing more complex analysis on the text, such as calculating trigrams or applying machine learning techniques.