<a href="https://colab.research.google.com/github/yardenmizrahi/NLP2/blob/main/NLP2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**NLP homework 2:**

1.) Use the corpus you created in homework 1

2.) Apply tokenization:

a. White space tokenizer

b. Regex tokenizer

c. Word tokenizer

d. Sentence tokenizer

3.) Apply Normalization:

a. Stemming

b. Lemmatization

4.) Remove Stop Words: conjunctions and articles

5.) Apply feature extraction by the following algorithms:

a. BOW

b. TF-IDF

c. Word embedding by WORD2VEC

6.) What is Glove? (Use ChatGPT). Can you apply it to your data? Explain the results.

7.) Select 5 sentences and apply to them Tagging by CYK


## *Yarden Mizrahi - 209521293*

### **Imports**

In [None]:
import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec
from nltk.corpus import stopwords, words
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize, RegexpTokenizer, WhitespaceTokenizer

import pandas as pd
from google.colab import drive

### **Loading The Data**

In [None]:
from google.colab import drive
drive.mount('/content/drive')
path = "/content/drive/MyDrive/Data/spam.csv"

Mounted at /content/drive


In [None]:
# Download nltk resources if not already present
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Load the dataset
df = pd.read_csv(path, encoding='latin-1')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Basic preprocessing
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
wordnet_lemmatizer = WordNetLemmatizer()

### **Relevant Functions**

In [None]:
# Tokenization
def whitespace_tokenizer(text):
    tokenizer = WhitespaceTokenizer()
    return tokenizer.tokenize(text)

def regex_tokenizer(text):
    tokenizer = RegexpTokenizer(r'\w+')
    return tokenizer.tokenize(text)

def word_tokenizer(text):
    return word_tokenize(text)

def sentence_tokenizer(text):
    return sent_tokenize(text)

# Normalization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def stemming(tokens):
    return [stemmer.stem(token) for token in tokens]

def lemmatization(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Remove stop words
def remove_stopwords(tokens):
    return [token for token in tokens if token not in stop_words]

In [None]:
# Process the corpus
corpus = df['v2'].tolist()
# Initialize the processed corpus list
final_processed_corpus = []

# Define the processing function
def process_text(text):
    # Step 1: Tokenization
    whitespace_tokens = whitespace_tokenizer(text)
    regex_tokens = regex_tokenizer(text)
    word_tokens = word_tokenizer(text)
    sentence_tokens = sentence_tokenizer(text)

    # Step 2: Normalization - Stemming and Lemmatization
    stemmed_tokens = stemming(word_tokens)
    lemmatized_tokens = lemmatization(word_tokens)

    # Step 3: Stop Words Removal
    stopword_removed_tokens = remove_stopwords(word_tokens)

    # Combine all transformations into a single representation
    combined_tokens = {
        'whitespace_tokens': whitespace_tokens,
        'regex_tokens': regex_tokens,
        'word_tokens': word_tokens,
        'sentence_tokens': sentence_tokens,
        'stemmed_tokens': stemmed_tokens,
        'lemmatized_tokens': lemmatized_tokens,
        'stopword_removed_tokens': stopword_removed_tokens
    }

    return combined_tokens

# Process each text in the corpus
for text in corpus:
    processed_text = process_text(text)
    final_processed_corpus.append(processed_text)


### **feature extraction**

In [None]:
# Feature Extraction
def bow_extraction(corpus):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(corpus)

def tfidf_extraction(corpus):
    vectorizer = TfidfVectorizer()
    return vectorizer.fit_transform(corpus)

def word2vec_extraction(tokens):
    model = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=4)
    return model

In [None]:
# Transform processed corpus back into text for BoW and TF-IDF
corpus_text = [' '.join(tokens['word_tokens']) for tokens in final_processed_corpus]

# Apply BoW and TF-IDF
bow_features = bow_extraction(corpus_text)
tfidf_features = tfidf_extraction(corpus_text)

# Combine all tokens from the processed corpus for word2vec training
all_tokens = [tokens['word_tokens'] for tokens in final_processed_corpus]
word2vec_model = word2vec_extraction(all_tokens)

In [None]:
bow_features

<5572x8660 sparse matrix of type '<class 'numpy.int64'>'
	with 74014 stored elements in Compressed Sparse Row format>

In [None]:
tfidf_features

<5572x8660 sparse matrix of type '<class 'numpy.float64'>'
	with 74014 stored elements in Compressed Sparse Row format>

In [None]:
word2vec_model

<gensim.models.word2vec.Word2Vec at 0x795db4505d50>

### **What is Glove?**
GloVe (Global Vectors for Word Representation) is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

In [None]:
import gensim.downloader as api

# Load pre-trained GloVe model
glove_model = api.load('glove-wiki-gigaword-100')

def get_glove_vectors(tokens):
    return [glove_model[token] for token in tokens if token in glove_model]

glove_vectors = get_glove_vectors(final_processed_corpus[0]['word_tokens'])



In [None]:
glove_vectors

[array([-0.11003  , -0.1945   , -0.0746   ,  0.13326  , -0.44721  ,
         0.14976  ,  0.58139  ,  0.53515  ,  0.41394  , -0.61779  ,
         0.93259  ,  0.1632   ,  0.56684  ,  0.37601  ,  0.3756   ,
        -0.74354  ,  0.32886  ,  0.18336  , -0.48717  , -0.52609  ,
        -0.43348  , -0.37205  ,  0.3386   , -0.22787  ,  0.06858  ,
         0.18793  , -0.27675  , -0.86805  ,  0.60734  ,  0.033616 ,
        -0.18512  ,  0.63641  , -0.29113  , -0.47396  , -0.83006  ,
         0.39132  , -0.33739  , -0.47392  , -0.45855  , -0.88229  ,
        -0.52698  , -0.73464  ,  0.24544  , -0.2111   ,  0.27155  ,
        -0.60551  ,  0.32774  , -0.84155  ,  0.4643   , -0.86121  ,
        -0.20803  , -0.25611  ,  0.12957  ,  1.5298   , -0.53295  ,
        -2.6471   , -0.72793  , -0.56359  ,  1.852    ,  0.49079  ,
        -0.2382   ,  1.0728   , -0.42418  ,  0.018965 ,  1.3196   ,
        -0.5115   , -0.082289 ,  0.1606   ,  0.041956 ,  0.30951  ,
         0.01204  , -0.29183  , -0.12679  , -0.1

**The results:**
The results represent word embeddings generated using the GloVe model. Each array corresponds to a word, and each number in the array represents a specific dimension in the vector space.

### **CYK**


In [None]:
from nltk import CFG
# CYK Parsing
grammar = CFG.fromstring("""
  S -> NP VP
  NP -> DT NN | DT NNS
  VP -> VBZ NP | VBD NP
  DT -> 'the' | 'a'
  NN -> 'dog' | 'cat'
  NNS -> 'dogs' | 'cats'
  VBZ -> 'chases'
  VBD -> 'chased'
""")

sentences = [
    "the dog chases the cat",
    "a cat chased the dogs",
    "the dogs chased a cat",
    "the dog chased a cat",
    "a dog chases the cats"
]

def cyk_parsing(sentence, grammar):
    tokens = sentence.split()
    parser = nltk.ChartParser(grammar)
    trees = list(parser.parse(tokens))
    return trees

for sentence in sentences:
    parse_trees = cyk_parsing(sentence, grammar)
    for tree in parse_trees:
        tree.pretty_print()

              S               
      ________|_____           
     |              VP        
     |         _____|___       
     NP       |         NP    
  ___|___     |      ___|___   
 DT      NN  VBZ    DT      NN
 |       |    |     |       |  
the     dog chases the     cat

              S                
      ________|_____            
     |              VP         
     |         _____|___        
     NP       |         NP     
  ___|___     |      ___|___    
 DT      NN  VBD    DT     NNS 
 |       |    |     |       |   
 a      cat chased the     dogs

               S               
      _________|_____           
     |               VP        
     |          _____|___       
     NP        |         NP    
  ___|___      |      ___|___   
 DT     NNS   VBD    DT      NN
 |       |     |     |       |  
the     dogs chased  a      cat

              S               
      ________|_____           
     |              VP        
     |         _____|___       
    

In [60]:
!echo "# NLP2" >> README.md
!git init
!git add README.md
!git commit -m "first commit"
!git branch -M main
!git remote set-url origin git@github.com:yardenmizrahi/NLP2.git
!git push -u origin main

Reinitialized existing Git repository in /content/.git/
[main 490fa78] first commit
 1 file changed, 1 insertion(+)
Host key verification failed.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
