<a href="https://colab.research.google.com/github/rdemarqui/topic_modeling/blob/main/customer_claim_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Customer claims analysis

## Open Dataset

In [29]:
import warnings
warnings.filterwarnings('ignore', 'error')

In [1]:
import pandas as pd

In [2]:
data_url = 'https://raw.githubusercontent.com/rdemarqui/topic_modeling/main/datasets/reclame_aqui_full.csv'
df = pd.read_csv(data_url)
print(df.shape)
df.head()

(500, 5)


Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,Title,link,description
0,0,0,Claro residência,/claro/claro-residencia_h-VgVeL0ez_dO42I/,Estou há mais de 7 dias tentando efetuar a con...
1,1,1,NO CONSIGO FALAR COM UM HUMANO,/claro/no-consigo-falar-com-um-humano_0kiSV_xg...,Realizei o pagamento das faturas atrasadas e s...
2,2,2,Oi virou claro e não recebe ligações!!!,/claro/oi-virou-claro-e-nao-recebe-ligacoes_MG...,Minha mãe tinha um chip da Oi que virou Claro ...
3,3,3,Mudança de plano sem a minha autorização,/claro/mudanca-de-plano-sem-a-minha-autorizaca...,Na sexta-feira dia 6 de outubro fiz uma recarg...
4,4,4,Plano PREZAO,/claro/plano-prezao_iW3NARVJ71tk_jVM/,"Bom diaNa data de 08/10/2023, recebi um SMS in..."


## Text Preprocess

The dataset that we will use was acquired through web scrapping on reclameaqui.com.br. This data is messy, since the users can write everithing they want. Because of that, we'll need to clean and make some modifications bofore we create our model.

### Normalize

Firstly, we will normalize our text. This proccess goes through some steps such as removing punctuation, accentuation, special characters and masked words.

In [3]:
import string

In [4]:
punctuations = list(string.punctuation)

def remove_punctuation(text):
  return ''.join([char if char not in punctuations else ' ' for char in text])

In [5]:
accentuation = {
    "á": "a", "ã": "a", "à": "a","â": "a",
    "é": "e","ê": "e",
    "í": "i",
    "ó": "o","õ": "o", "ô":"o",
    "ú": "u",
    "ç": "c"
    }

def remove_accentuation(text):
  return ''.join(accentuation.get(char, char) for char in text)

In [6]:
special_charaters = [
    "1","2","3","4","5","6","7","8","9","0",
    " a "," b "," c "," d "," e "," f "," g "," h "," i "," j "," k "," l ",
    " m "," n "," o "," p "," q "," r "," s "," t "," u "," v "," x "," z ",
    "r$", "$",
    ]

def remove_special_characters(text):
  return ''.join([char if char not in special_charaters else ' ' for char in text])

Reclame Aqui website often masks data when it is sensitive information or offensive words using "editado por reclame aqui" in place. We will remove it from our database.

In [7]:
def remove_mask(text):
  return text.replace("editado pelo reclame aqui", "")

In [8]:
def normalize_text(df, text_field, lower=True, rem_punct=True, rem_accent=True, rem_spec_caract=True, rem_mask=True):
  df[text_field + "_clean"] = df[text_field].astype(str)
  if lower: df[text_field + "_clean"] = df[text_field + "_clean"].str.lower()
  if rem_punct: df[text_field + "_clean"] = df[text_field + "_clean"].apply(remove_punctuation)
  if rem_accent: df[text_field + "_clean"] = df[text_field + "_clean"].apply(remove_accentuation)
  if rem_spec_caract: df[text_field + "_clean"] = df[text_field + "_clean"].apply(remove_special_characters)
  if rem_mask: df[text_field + "_clean"] = df[text_field + "_clean"].apply(remove_mask)
  df[text_field + "_clean"] = df[text_field + "_clean"].replace(r'\s+', ' ', regex=True) #remove spaces

In [9]:
# Normalizing text in a new column
normalize_text(df, 'description')

### Stop Words

Stop words make the topics poor. Removing stop words can reduce noise and improve the efficiency of text analysis algorithms.

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [11]:
stops_nltk = nltk.corpus.stopwords.words('portuguese')
stop_words = list(stops_nltk)
stop_words = [remove_accentuation(word) for word in stop_words]
stop_words = list(set(stop_words))

def remove_stop_words(text):
    words_list = text.split()
    words_list = [word for word in words_list if word not in stop_words]
    text = ' '.join(words_list)

    return text

In [12]:
# Removing Stop Words
df["description_clean_stop"] = df["description_clean"].apply(remove_stop_words)

### Lemmatize

Lemmatization is a natural language processing technique that reduces words to their base or root form. It helps in standardizing words, improving text analysis, and simplifying language understanding.

In [13]:
import spacy.cli

language_models = ['pt_core_news_sm', 'pt_core_news_md', 'pt_core_news_lg']
spacy_model = language_models[2]

disable = ['tagger', 'parser', 'ner', 'entity_ruler', 'entity_linker', 'textcat',
           'morphologizer','attribute_ruler']

try:
  nlp = spacy.load(spacy_model, disable=disable)
except:
  spacy.cli.download(spacy_model)
  nlp = spacy.load(spacy_model, disable=disable)

print(nlp.pipe_names)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('pt_core_news_lg')
['tok2vec', 'lemmatizer']


In [14]:
def lemma_pipe(doc_list):
  lemma_text_list = []
  for doc in nlp.pipe(doc_list, n_process=-1):
      lemma_text_list.append(" ".join(token.lemma_ for token in doc))

  return lemma_text_list

In [15]:
# Applying lemmatization
df["description_clean_stop_lemma"] = lemma_pipe(df["description_clean_stop"])

### TF-IDF

Removing low TF-IDF words is essential in topic modeling because these words probably provide little discriminatory power and can dominate topics. By excluding them, the model focuses on more meaningful and distinctive terms, improving topic quality and helping to identify relevant keywords that truly characterize the underlying topics in a corpus.

In [16]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

import gensim
import gensim.corpora as corpora
from gensim.models import TfidfModel

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [48]:
documents = df["description_clean_stop_lemma"]

In [49]:
# Tokenize text
token_docs = [word_tokenize(text) for text in documents]

# Convert text to frequency dictionary
dictionary = corpora.Dictionary(token_docs)

# Convert text in bag of words
corpus = [dictionary.doc2bow(text) for text in token_docs]

# Calculating TF-IDF
tfidf = TfidfModel(corpus, id2word=dictionary)

In [44]:
# Filtering low TF-IDF values
low_value = 0
low_value_words = []
for bow in corpus:
    low_value_words += [id for id, value in tfidf[bow] if value < low_value]

# Removing low TF-IDF from dictionary
dictionary.filter_tokens(bad_ids=low_value_words)

# Recompute the corpus with low value words are filtered out:
corpus = [dictionary.doc2bow(doc) for doc in token_docs]

In [45]:
qtd = 0
for item in corpus:
  if len(item) == 0: qtd += 1

print(qtd)

0


## Latent Dirichlet Allocation (LDA)

In [50]:
lda_model = gensim.models.LdaModel(corpus,
                                   num_topics=10,
                                   id2word=dictionary,
                                   random_state=42,
                                   update_every=1,
                                   chunksize=50,
                                   passes=15,
                                   alpha="auto")

## Visualization

In [21]:
try:
  import pyLDAvis.gensim
except:
  !pip install pyLDAvis==2.1.2 -q
  import pyLDAvis.gensim

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.6 MB[0m [31m2.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m0.8/1.6 MB[0m [31m12.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone


In [51]:
vis = pyLDAvis.gensim.prepare(lda_model, corpus, lda_model.id2word, mds="mmds", R=20)
pyLDAvis.display(vis)