### Chunknig

In [1]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

# Sample sentence
sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and POS-tag the sentence
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)
tags

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumps', 'VBZ'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('dog', 'NN'),
 ('.', '.')]

#### Notes on RE: 
- <> Means that what's inside is a Part of Speech tag and not a string.
- ? means optional --> $<DT>?$ means optional determiner
- $*$ means zero or more occurances --> $<JJ>*$ means zero or more occurances of Adjective
- 


In [4]:
# Define a chunking pattern using regular expressions
# This example finds noun phrases (NP) like 'The quick brown fox'
grammar = """
  NP: {<DT>?<JJ>*<NN>}   # Determiner (DT) + Adjective (JJ) + Noun (NN)
  VP: {<VB.*>}            # Verb phrase (VP)
"""

# Create a chunk parser using the grammar
chunk_parser = RegexpParser(grammar)

# Apply the chunking pattern to the POS-tagged sentence
tree = chunk_parser.parse(tags)

# Show the chunked sentence
tree.pretty_print()


                               S                                           
    ___________________________|________________________________            
   |     |            NP               NP       VP              NP         
   |     |     _______|________        |        |        _______|______     
over/IN ./. The/DT quick/JJ brown/NN fox/NN jumps/VBZ the/DT lazy/JJ dog/NN



Note that brown is tagged as a noun while it is actually an adjective. So, when using POS, we need to be aware of the errors. But overall, the Chunking operation extracts patterns in the text which can be very useful for extracting information or task of NER. 

### Chinking 
Once the chunking is finished, you might want to exclude the prepositions, or adjectives. This is done by Chunking

In [10]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

# Sample sentence
sentence = "The quick brown fox jumped over the lazy dog."

# Tokenize and POS-tag the sentence
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

# Define chunking pattern for noun phrases (NP)
# We will include determiners (DT), adjectives (JJ), and nouns (NN), 
# but we will remove any prepositions (IN) from the chunk.
grammar = """
  NP: {<DT>?<JJ>*<NN>*}   # Noun Phrase: Optional determiner, adjectives, and nouns
  NP: {<DT>?<JJ>*<NN> -<IN>}   # Exclude prepositions (IN) from noun phrase
"""

# Create a chunk parser using the grammar
chunk_parser = RegexpParser(grammar)

# Apply the chunking and chinking rules to the POS-tagged sentence
tree = chunk_parser.parse(tags)

# Show the chunked sentence
tree.pretty_print()


                                 S                                          
     ____________________________|_______________________________            
    |         |     |            NP                              NP         
    |         |     |     _______|________________        _______|______     
jumped/VBD over/IN ./. The/DT quick/JJ brown/NN fox/NN the/DT lazy/JJ dog/NN



In [14]:
import nltk
from nltk import pos_tag, word_tokenize
from nltk.chunk import RegexpParser

# Sample sentence
sentence = "The quick brown fox jumped over the lazy dog."

# Tokenize and POS-tag the sentence
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

# Define chunking pattern for noun phrases (NP)
# We will include determiners (DT), adjectives (JJ), and nouns (NN), 
# but we will remove any prepositions (IN) from the chunk.
grammar = """
  NP: {<DT>?<JJ>*<NN>*}   # Noun Phrase: Optional determiner, adjectives, and nouns
  VP: {<VB.*><IN> -<IN>} 
  NP: {<DT>?<JJ>*<NN>}   # Exclude prepositions (IN) from noun phrase
"""

# Create a chunk parser using the grammar
chunk_parser = RegexpParser(grammar)

# Apply the chunking and chinking rules to the POS-tagged sentence
tree = chunk_parser.parse(tags)

# Show the chunked sentence
tree.pretty_print()


                                 S                                          
     ____________________________|_______________________________            
    |         |     |            NP                              NP         
    |         |     |     _______|________________        _______|______     
jumped/VBD over/IN ./. The/DT quick/JJ brown/NN fox/NN the/DT lazy/JJ dog/NN



### First coding practice: 

In [1]:
import nltk 
import random 
import re, unicodedata
import string 
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
import wikipediaapi as wk
from collections import defaultdict
import warnings 
warnings.filterwarnings('ignore')
nltk.download('wordnet')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity, linear_kernel 

[nltk_data] Downloading package wordnet to /Users/apple/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
with open('HR.txt', 'r') as file : 
    raw = file.read()

In [3]:
# Convert to lower case
raw = raw.lower()
raw[:2000]

'human resource analytics is at the intersection of three bodies of knowledge:\n\nhuman resource management: sets the meaning and purpose of the analytics. \n\ndata warehousing: knowing how to process and store hr data efficiently, automation of  collection of data and cleaning data.\n\nstatistical analysis, presentation and interpretation : helps in translating the identified hr  issues into appropriate analyses and communication of results.\n\n\n\n5 fundamental principles of analytics\n\nhr analytics is about metrics and measurement. good metrics definitions, both narrative and formulaic, and their documentation are key.\n\na professional and good hr analytics person will have the above bodies of knowledge and know their process and intersection.\n\ngood communication and collaborative skills are essential. the in-depth expertise in your organization is likely to exist in hrm. it. and decision support. you will need to collaborate with these groups.\n\nthe extent of hr analytics can 

In [4]:
# Tokenize the data into sentences 
sent_tokens = nltk.sent_tokenize(raw)
sent_tokens[0:6]

['human resource analytics is at the intersection of three bodies of knowledge:\n\nhuman resource management: sets the meaning and purpose of the analytics.',
 'data warehousing: knowing how to process and store hr data efficiently, automation of  collection of data and cleaning data.',
 'statistical analysis, presentation and interpretation : helps in translating the identified hr  issues into appropriate analyses and communication of results.',
 '5 fundamental principles of analytics\n\nhr analytics is about metrics and measurement.',
 'good metrics definitions, both narrative and formulaic, and their documentation are key.',
 'a professional and good hr analytics person will have the above bodies of knowledge and know their process and intersection.']

The dictionary remove_punk_dict maps every punctuation to a None value. The once the translate function is called, every character in the text will be run through the translate function and if the character is matched with any key in remove_punk_dict, that character is mapped to None, i.e. will be removed. 

The key feature of defaultdict is that it allows you to specify a default value that will be returned when you try to access a key that doesn't exist in the dictionary.

### What Unicode Normalization Does

Unicode normalization is a process that changes a string to a standardized form, which is helpful when dealing with characters that may have multiple representations. For example, some characters can be represented as a single "composite" character or as multiple "decomposed" characters.

### NFKD Normalization

The `"NFKD"` normalization form stands for **Normalization Form KD (Compatibility Decomposition)**. It breaks characters down into their basic parts. Specifically:

1. **Decomposition**: It splits characters with diacritics (like accents or tildes) into separate components. For instance:
   - `é` becomes `e` + `´` (an 'e' with a separate accent mark).
   - `ñ` becomes `n` + `~` (an 'n' with a separate tilde mark).
   
2. **Compatibility Decomposition (KD)**: It also converts characters to simpler, compatibility-equivalent forms. For example:
   - The symbol `ℌ` (script capital H) will normalize to the standard `H`.
   - The ligature `ﬂ` will normalize to `f` + `l`.

After normalizing with `NFKD`, if you then encode to ASCII (as the code does), any non-ASCII components (like diacritics) are removed, leaving just the ASCII characters.

### Example

Here’s what happens when you apply `unicodedata.normalize('NFKD', word)` on different characters:

- `"café"` → `"cafe" + ´`
- `"résumé"` → `"resume" + ´ + ´`
- `"ℌ"` → `"H"`
- `"½"` (the fraction one-half) → `"1/2"`

In short, `NFKD` normalization simplifies text by decomposing characters and stripping accents or special markings, which is helpful for creating ASCII-compatible strings.

Then we encode each character to ascii and if any character is not known (like the accents), they will be ignored. 
We then decode back to the 'utf-8', while ingoring the unknown characters. 

In [5]:
def Normalize(x): 
    remove_punk_dict = dict((ord(punct),None) for punct in string.punctuation)
    word_token = nltk.word_tokenize(x.lower().translate(remove_punk_dict))
    
    #remove ascii  ["café", "résumé", "naïve", "façade", "jalapeño", "niño"] -> ["cafe", "resume", "naive", "facade", "jalapeno", "nino"]
    new_words = []
    for word in word_token:
        new_word = unicodedata.normalize('NFKD', word).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        new_words.append(new_word)
        
    # Modify HTML tags
    rmv = []
    for w in new_words:
        text=re.sub("&lt;/?.*?&gt;","&lt;&gt;",w)
        rmv.append(text)
        
    # Add POS tags 
    pos_map = defaultdict(lambda: wn.NOUN)
    pos_map['J'] = wn.ADJ 
    pos_map['V'] = wn.VERB
    pos_map['R'] = wn.ADV
    # Define the word lemmatizer 
    lmtzr = WordNetLemmatizer()
    lemma_list = []
    rmv = [i for i in rmv if i] # if i means if it's not NONE or empty list -> remove any empty lists 
    for token, tag in nltk.pos_tag(rmv):
        lemma = lmtzr.lemmatize(token, pos_map[tag[0]])
        lemma_list.append(lemma)
    return(lemma_list)

In [21]:
pos_map = defaultdict(lambda: wn.NOUN)
pos_map['J'] = wn.ADJ 
pos_map['V'] = wn.VERB
pos_map['R'] = wn.ADV

In [17]:
res = nltk.word_tokenize(sent_tokens[0])

In [24]:
pos_map[res_tag[0][0]]


'n'

In [25]:
res_tag = nltk.pos_tag(res)
res_tag

[('human', 'JJ'),
 ('resource', 'NN'),
 ('analytics', 'NNS'),
 ('is', 'VBZ'),
 ('at', 'IN'),
 ('the', 'DT'),
 ('intersection', 'NN'),
 ('of', 'IN'),
 ('three', 'CD'),
 ('bodies', 'NNS'),
 ('of', 'IN'),
 ('knowledge', 'NN'),
 (':', ':'),
 ('human', 'JJ'),
 ('resource', 'NN'),
 ('management', 'NN'),
 (':', ':'),
 ('sets', 'VBZ'),
 ('the', 'DT'),
 ('meaning', 'NN'),
 ('and', 'CC'),
 ('purpose', 'NN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('analytics', 'NNS'),
 ('.', '.')]

In [13]:
sent_tokens[0]

'human resource analytics is at the intersection of three bodies of knowledge:\n\nhuman resource management: sets the meaning and purpose of the analytics.'

In [11]:
Normalize(sent_tokens[0])

['human',
 'resource',
 'analytics',
 'be',
 'at',
 'the',
 'intersection',
 'of',
 'three',
 'body',
 'of',
 'knowledge',
 'human',
 'resource',
 'management',
 'set',
 'the',
 'meaning',
 'and',
 'purpose',
 'of',
 'the',
 'analytics']

In [6]:
welcome_input = ("hello", "hi", "greetings", "sup", "what's up","hey",)
welcome_response = ["hi", "hey", "*nods*", "hi there", "hello", "I am glad! You are talking to me"]
def welcome(user_response):
    """
    This is the welcome part of the ChatBot 
    """
    for word in user_response.split():
        if word.lower() in welcome_input:
            return random.choice(welcome_response)

### TF-IDF Vectorizer

**TF-IDF (Term Frequency-Inverse Document Frequency)** is a numerical statistic used in text mining and information retrieval to reflect how important a word is to a document in a collection or corpus. The **TF-IDF Vectorizer** transforms a text into a vector of TF-IDF scores, helping represent the text's "importance" in a way that can be used for machine learning tasks.

#### 1. What is TF-IDF?

TF-IDF combines two metrics:

- **Term Frequency (TF)**: This measures the frequency of a word in a document. A higher term frequency indicates the word is more representative of that document.
  - **Formula**:
    
  - $ \text{TF} = \frac{\text{Number of times a term appears in a document}}{\text{Total number of terms in the document}}
    $
- **Inverse Document Frequency (IDF)**: This measures the importance of a word across all documents in the corpus. Words that appear frequently in many documents (like "the", "is") receive a lower score, while words unique to specific documents receive a higher score.
  - **Formula**:
    
    $
    \text{IDF} = \log \left( \frac{\text{Total number of documents}}{\text{Number of documents containing the term}} \right)
    $

#### 2. Calculating TF-IDF

The TF-IDF score for a term in a document is calculated as:
$
\text{TF-IDF} = \text{TF} \times \text{IDF}
$
The result gives a higher weight to terms that are frequent in a specific document but not frequent across all documents in the corpus.

#### 3. Example of TF-IDF Calculation

Suppose we have three documents:

- Doc 1: "The cat in the hat."
- Doc 2: "The cat is very fast."
- Doc 3: "A fast cat is in the hat."

For each term, we would calculate TF for each document, IDF across all documents, and then multiply these values to get the TF-IDF score for each term in each document.


In [7]:
#### 4. Using TF-IDF in Scikit-Learn

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample documents
documents = [
    "The cat in the hat.",
    "The cat is very fast.",
    "A fast cat is in the hat."
]

# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)

# Show the TF-IDF matrix
print(tfidf_matrix.toarray())

# Get feature names (terms)
print(vectorizer.get_feature_names_out()) # the dictionary 

[[0.34676577 0.         0.44652407 0.44652407 0.         0.69353155
  0.        ]
 [0.34957775 0.45014501 0.         0.         0.45014501 0.34957775
  0.59188659]
 [0.34035465 0.43826859 0.43826859 0.43826859 0.43826859 0.34035465
  0.        ]]
['cat' 'fast' 'hat' 'in' 'is' 'the' 'very']


In [8]:
def generateResponse(user_response):
    robo_response=''
    sent_tokens.append(user_response) # append the inquiry to the sent_tokens 
    TfidfVec = TfidfVectorizer(tokenizer=Normalize, stop_words='english')
    tfidf = TfidfVec.fit_transform(sent_tokens)
    # Calculate the similarity of the inquiry with any of the sentences (vectors)
    vals = linear_kernel(tfidf[-1], tfidf)
    idx=vals.argsort()[0][-2] # this is the index of the sentence most similar to the inquiry 
    flat = vals.flatten() # shape (1,170)--> (170)
    flat.sort() # sort based on an ascending order 
    req_tfidf = flat[-2]# the first most similar is the document itself 
    if(req_tfidf==0) or "tell me about" in user_response: # meaning that no similariy is found or the prompt is totally unrelated to the text data  
        print("Checking Wikipedia")
        if user_response:# if the user_response passed is not emtpy
            robo_response = wikipedia_data(user_response)
            return robo_response
    else:
        robo_response = robo_response+sent_tokens[idx] # return the related sentence to teh inquiry 
        return robo_response

def wikipedia_data(input):
    reg_ex = re.search('tell me about (.*)', input) 
    try:
        if reg_ex: # if the reg_ex is not empty 
            topic = reg_ex.group(1) #extract the topic that the user has asked about 
        else: 
            topic = input
        wiki = wikipedia.summary(topic, sentences = 3)
        return wiki
        
    except Exception as e:
            print("No content has been found")

In [26]:
import wikipedia
wikipedia.summary('York', sentences = 3)

'York is a cathedral city in North Yorkshire, England, with Roman origins, sited at the confluence of the rivers Ouse and Foss. It is the county town of Yorkshire. The city has many historic buildings and other structures, such as a minster, castle, and city walls.'

In [9]:
len(sent_tokens)

169

In [15]:
inquiry = 'What is the equality act, 2010 of the United Kingdom?'

In [37]:
sent_tokens.append(inquiry)

In [10]:
TfidfVec = TfidfVectorizer(tokenizer=Normalize, stop_words='english')
tfidf = TfidfVec.fit_transform(sent_tokens)

In [11]:
tfidf.shape

(169, 1027)

In [12]:
tfidf[0]

<1x1027 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

In [39]:
vals = linear_kernel(tfidf[-1], tfidf) # find the similarity of every sentence with the inquiry 

In [45]:
vals #array of 170 indecies 

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.07651511, 0.18818161, 0.        ,
        0.4672911 , 0.03159562, 0.        , 0.        , 0.        ,
        0.06151453, 0.05121481, 0.05494332, 0.        , 0.        ,
        0.07776611, 0.        , 0.07161565, 0.        , 0.        ,
        0.        , 0.05805943, 0.0813911 , 0.        , 0.        ,
        0.        , 0.0331519 , 0.        , 0.        , 0.        ,
        0.07070032, 0.05155054, 0.        , 0.        , 0.        ,
        0.03358599, 0.02507002, 0.        , 0.02691169, 0.        ,
        0.        , 0.        , 0.        , 0.03257163, 0.03147009,
        0.02684319, 0.        , 0.1040516 , 0.        , 0.        ,
        0.        , 0.0443987 , 0.06375502, 0.  

In [54]:
idx=vals.argsort()
idx

array([[  0, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
        119, 120, 121, 122, 123, 106, 124, 105, 103,  86,  87,  88,  89,
         90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102,
        104, 125, 126, 127, 150, 151, 152, 153, 154, 155, 156, 157, 158,
        159, 160, 161, 163, 164, 165, 166, 167, 149, 148, 147, 146, 128,
        129, 130, 131, 132, 133, 134, 135,  85, 136, 138, 139, 140, 141,
        142, 143, 144, 145, 137, 168,  84,  82,  19,  20,  21,  24,  27,
         28,  18,  29,  34,  36,  38,  39,  83,  43,  33,  44,  17,  15,
          1,   2,   3,   4,   5,   6,  16,   7,   9,  10,  11,  12,  13,
         14,   8,  45,  40,  62,  60,  57,  78,  47,  66,  61,  77,  80,
         59,  54,  53,  52,  69,  70,  73,  74,  49,  48,  68,  76,  56,
         65,  58,  64,  26,  63,  46,  55, 162,  79,  71,  31,  51,  75,
         32,  41,  81,  30,  72,  50,  37,  22,  35,  42,  67,  23,  25,
        169]])

In [47]:
vals.argsort()[0]

array([  0, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118,
       119, 120, 121, 122, 123, 106, 124, 105, 103,  86,  87,  88,  89,
        90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102,
       104, 125, 126, 127, 150, 151, 152, 153, 154, 155, 156, 157, 158,
       159, 160, 161, 163, 164, 165, 166, 167, 149, 148, 147, 146, 128,
       129, 130, 131, 132, 133, 134, 135,  85, 136, 138, 139, 140, 141,
       142, 143, 144, 145, 137, 168,  84,  82,  19,  20,  21,  24,  27,
        28,  18,  29,  34,  36,  38,  39,  83,  43,  33,  44,  17,  15,
         1,   2,   3,   4,   5,   6,  16,   7,   9,  10,  11,  12,  13,
        14,   8,  45,  40,  62,  60,  57,  78,  47,  66,  61,  77,  80,
        59,  54,  53,  52,  69,  70,  73,  74,  49,  48,  68,  76,  56,
        65,  58,  64,  26,  63,  46,  55, 162,  79,  71,  31,  51,  75,
        32,  41,  81,  30,  72,  50,  37,  22,  35,  42,  67,  23,  25,
       169])

In [49]:
idx=vals.argsort()[0][-2]
idx

25

In [52]:
flat = vals.flatten()
flat.shape

(170,)

In [55]:
flat.sort() # sort based on an ascending order 
flat

array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.     

In [None]:
req_tfidf = flat[-2]

In [27]:
import re

input = "tell me about Python programming"
reg_ex = re.search('tell me about (.*)', input)
reg_ex.group(1) 

'Python programming'

In [30]:
flag=True
print("This is Shirin's first ChatBot! How can I help you today?")
while(flag==True):
    # Ask for an input 
    user_response = input()
    user_response=user_response.lower()
    if(user_response not in ['bye','shutdown','exit', 'quit']):
        if(user_response=='thanks' or user_response=='thank you' ):
            flag=False
            print("Chatterbot : You are welcome..")
        else:
            if(welcome(user_response)!=None):# Meaning that a greeting is passed to the bot
                print("Chatterbot : "+welcome(user_response))
            else:
                print("Chatterbot : ",end="")
                print(generateResponse(user_response))
                sent_tokens.remove(user_response)
    else:
        flag=False
        print("Chatterbot : Bye!!! ")

This is Shirin's first ChatBot! How can I help you today?


 What is the equality act, 2010 of the United Kingdom?


Chatterbot : the equality act, 2010 of the united kingdom prohibits discrimination and mandates equal treatment in matters of employment as well as private and public services irrespective of race, age, sex, religion or disability .


 What are the main focus areas for use of analytics in the Asia-Pacific countries?


Chatterbot : compensation and benefits, talent acquisition, talent development and productivity are the established focus areas for use of analytics in the asia-pacific region.


 What article in the Indian Constitution talks about the right of Indian citizens?


Chatterbot : adherence to the rule of equality in public employment is a being feature of indian constitution and the rule of law is its core, the court cannot disable itself from making an order inconsistent with article 14 and 16 of the indian constitution.


 Tell me about York University 


Chatterbot : Checking Wikipedia
York University (French: Université York), also known as YorkU or simply YU,  is a public research university in Toronto, Ontario, Canada. It is Canada's third-largest university, and it has approximately 53,500 students, 7,000 faculty and staff, and over 375,000 alumni worldwide. It has 11 faculties, including the Lassonde School of Engineering, Schulich School of Business, Osgoode Hall Law School, Glendon College, and 32 research centres.


 Thank you


Chatterbot : You are welcome..


Suggested pathways: 
* Explore how this model performs while equipped with attention models and word embeddings
* On a bigger scale, we can create a text summarization tool that takes in any pdf file and provides a summary of this pdf 

In [47]:
import unicodedata

# Sample list of words with non-ASCII characters
word_token = ['résumé', 'café', 'façade', 'niño', 'jalapeño']

# Initialize an empty list to store the cleaned words
new_words = []

# Loop through each word in the list `word_token`
for word in word_token:
    # Normalize, encode to ASCII, and decode back
    new_word = unicodedata.normalize('NFKD', word) \
        .encode('ascii', 'ignore')  # Remove non-ASCII characters
    
    # Decode back to UTF-8 and append to new_words list
    new_word = new_word.decode('utf-8', 'ignore')
    
    # Append cleaned word to the list
    new_words.append(new_word)

# Output the new list of words
print(new_words)
