## Employee Reviews da Microsoft 
### Objetivo:
Desenvolver um sistema que analisa avaliações de funcionários da Microsoft, processa os textos e classifica automaticamente cada review como positivo ou negativo, com base no conteúdo textual.

### Coleta e Organização dos Dados

In [98]:
import pandas as pd

df = pd.read_csv("employee_reviews.csv")

# Filtrando as avaliações da Microsoft
df = df[df['company'].str.lower() == 'microsoft'].reset_index(drop=True)

df.columns


Index(['Unnamed: 0', 'company', 'location', 'dates', 'job-title', 'summary',
       'pros', 'cons', 'advice-to-mgmt', 'overall-ratings',
       'work-balance-stars', 'culture-values-stars',
       'carrer-opportunities-stars', 'comp-benefit-stars',
       'senior-mangemnet-stars', 'helpful-count', 'link'],
      dtype='object')

### Pré-processamento de Texto

In [99]:
import nltk

nltk.data.path.clear()
nltk.data.path.append('./nltk_data')


In [100]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [101]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import RegexpTokenizer



# Aponta para a pasta nltk_data na sua máquina (adapte o caminho se precisar)
nltk.data.path.append(r'C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data')

# Baixe os recursos necessários caso ainda não tenha
nltk.download('stopwords', download_dir=r'C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data')
nltk.download('wordnet', download_dir=r'C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data')
nltk.download('omw-1.4', download_dir=r'C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data')

# Inicializa componentes
tokenizer = RegexpTokenizer(r'\w+')  # tokeniza só palavras, sem pontuação
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)  # remove pontuação (opcional se usar tokenizer)
    tokens = tokenizer.tokenize(text)
    tokens = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return ' '.join(tokens)

# Teste simples
print(preprocess("Hello, how are you doing? Running and working!"))


hello running working


[nltk_data] Downloading package stopwords to
[nltk_data]     C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data
[nltk_data]     ...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data
[nltk_data]     ...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:/Users/rrs4_cesar/Desktop/Microsoft_Review/nltk_data
[nltk_data]     ...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [102]:
# 1. Criar coluna 'text'
df['text'] = df['pros'].fillna('') + ' ' + df['cons'].fillna('')

# 2. Verificar se 'text' foi criada corretamente
print(df[['pros', 'cons', 'text']].head())

# 3. Aplicar preprocess
df['clean_text'] = df['text'].apply(preprocess)

# 4. Conferir resultado
print(df[['text', 'clean_text']].head())


                                                pros  \
0  Culture, role impact, mission driven, collabor...   
1  1. If you love tech, this is a great place. No...   
2                     Great company and Great people   
3  Benefits, work-life balance, tons of internal ...   
4  Smart people, work life balance, growth mindse...   

                                                cons  \
0          Volume of work is sometimes unmanageable,   
1  Brand on Your Resume: After many years of losi...   
2                         I see no cons at this time   
3                       Can't think of any right now   
4                 Can be hard to transfer internally   

                                                text  
0  Culture, role impact, mission driven, collabor...  
1  1. If you love tech, this is a great place. No...  
2  Great company and Great people I see no cons a...  
3  Benefits, work-life balance, tons of internal ...  
4  Smart people, work life balance, growth mindse..

In [103]:
df['clean_text'] = df['text'].apply(preprocess)
df


Unnamed: 0.1,Unnamed: 0,company,location,dates,job-title,summary,pros,cons,advice-to-mgmt,overall-ratings,work-balance-stars,culture-values-stars,carrer-opportunities-stars,comp-benefit-stars,senior-mangemnet-stars,helpful-count,link,text,clean_text
0,49600,microsoft,none,"Dec 11, 2018",Current Employee - Anonymous Employee,Microsoft,"Culture, role impact, mission driven, collabor...","Volume of work is sometimes unmanageable,",none,5.0,4.0,5.0,5.0,5.0,5.0,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,"Culture, role impact, mission driven, collabor...",culture role impact mission driven collaborati...
1,49601,microsoft,"Redmond, WA","Jan 28, 2013",Current Employee - Anonymous Employee,Thoughts after 10 years....,"1. If you love tech, this is a great place. No...",Brand on Your Resume: After many years of losi...,I'll type it here - but I don't they are liste...,4.0,4.0,2.0,2.0,4.0,none,1439,https://www.glassdoor.com/Reviews/Microsoft-Re...,"1. If you love tech, this is a great place. No...",1 love tech great place doubt youll talk tech ...
2,49602,microsoft,"Redmond, WA","Dec 9, 2018",Current Employee - Anonymous Employee,Technical Account Manager,Great company and Great people,I see no cons at this time,Keep up the great work,5.0,4.0,5.0,5.0,5.0,5.0,1,https://www.glassdoor.com/Reviews/Microsoft-Re...,Great company and Great people I see no cons a...,great company great people see con time
3,49603,microsoft,"Chicago, IL","Dec 9, 2018",Current Employee - CSA,Great company,"Benefits, work-life balance, tons of internal ...",Can't think of any right now,none,5.0,5.0,5.0,5.0,5.0,5.0,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,"Benefits, work-life balance, tons of internal ...",benefit worklife balance ton internal knowledg...
4,49604,microsoft,none,"Dec 9, 2018",Current Employee - Anonymous Employee,Great Company to work for,"Smart people, work life balance, growth mindse...",Can be hard to transfer internally,none,5.0,5.0,5.0,4.0,5.0,5.0,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,"Smart people, work life balance, growth mindse...",smart people work life balance growth mindset ...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17925,67525,microsoft,none,"Dec 16, 2010",Former Employee - Anonymous Employee,Enriching experience for a beginner but bad fo...,"-Access to a wide range of technologies, compl...",-Testers(SDET's ) do not get as many opportuni...,Make the company leaner and Meaner. (which wou...,3.0,3.0,none,4.0,4.0,2.0,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,"-Access to a wide range of technologies, compl...",access wide range technology complete exposure...
17926,67526,microsoft,none,"Dec 16, 2010",Current Employee - Senior Marketing Manager,A complex and interesting experience,- Once you're at Microsoft you can change role...,- Be prepared to be flexible - frequent change...,none,3.0,1.5,none,2.5,4.0,2.5,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,- Once you're at Microsoft you can change role...,youre microsoft change role either choice freq...
17927,67527,microsoft,none,"Dec 15, 2010",Current Employee - Account Manager,Good Place to Work,Nice place to work. Good atmosphere with advan...,Management confusion at times with vision for ...,none,4.0,3.0,none,4.0,4.5,3.5,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,Nice place to work. Good atmosphere with advan...,nice place work good atmosphere advancement ma...
17928,67528,microsoft,none,"Dec 15, 2010",Current Employee - Senior Test Lead,"It's a competitive work place, with overload w...","Smart people around you, can learn from them","Politics, weak moral, leaning loyalty",none,3.0,2.0,none,3.0,3.5,3.0,0,https://www.glassdoor.com/Reviews/Microsoft-Re...,"Smart people around you, can learn from them P...",smart people around learn politics weak moral ...


### Criação dos Rótulos (Positivo/Negativo)

In [104]:
# Rótulo binário: 1 = positivo (rating >= 4), 0 = negativo (rating <= 2)
df = df[df['overall-ratings'].isin([1.0, 2.0, 4.0, 5.0])]
df['label'] = df['overall-ratings'].apply(lambda x: 1 if x >= 4 else 0)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = df['overall-ratings'].apply(lambda x: 1 if x >= 4 else 0)


### Extração de Features

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=3000)
X = vectorizer.fit_transform(df['clean_text'])
y = df['label']


### Modelagem e Classificação

In [106]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Criação e treinamento do modelo com class_weight='balanced'
model = RandomForestClassifier(n_estimators=200, max_depth=10, class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

# Predição e avaliação
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.42      0.60      0.49       389
           1       0.93      0.87      0.90      2434

    accuracy                           0.83      2823
   macro avg       0.67      0.73      0.69      2823
weighted avg       0.86      0.83      0.84      2823



### Simular Novos Reviews

In [107]:
def predict_review(text):
    clean = preprocess(text)
    print(f"Texto limpo: {clean}")
    vec = vectorizer.transform([clean])
    print(f"Vetorizado: {vec.toarray()}")
    prediction = model.predict(vec)
    print(f"Predição: {prediction}")
    return "Positivo" if prediction[0] == 1 else "Negativo"

# Exemplo de uso
print(predict_review("The team is very collaborative, and the work-life balance is great."))


Texto limpo: team collaborative worklife balance great
Vetorizado: [[0. 0. 0. ... 0. 0. 0.]]
Predição: [1]
Positivo


In [108]:
#Salvando os documentos

import joblib

# Supondo que model e vectorizer (tfidf) já estejam treinados
joblib.dump(model, "modelo_sentimento.pkl")
joblib.dump(vectorizer, "vetorizador_tfidf.pkl")


['vetorizador_tfidf.pkl']

### Visualização e Interface

In [109]:
import gradio as gr
import joblib
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Certifique-se de ter feito isso antes (no seu outro arquivo):
# joblib.dump(model, "modelo_sentimento.pkl")
# joblib.dump(tfidf, "vetorizador_tfidf.pkl")

# Carregando o modelo e o vetor TF-IDF
model = joblib.load("modelo_sentimento.pkl")
tfidf = joblib.load("vetorizador_tfidf.pkl")

# Baixar recursos necessários
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Função de pré-processamento
def preprocess(text):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(text.lower())
    tokens = [t for t in tokens if t not in string.punctuation]
    tokens = [t for t in tokens if t not in stopwords.words('english')]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)

# Função para a interface
def classificar_review(review_text):
    texto_limpo = preprocess(review_text)
    vetor = tfidf.transform([texto_limpo])
    pred = model.predict(vetor)[0]
    prob = model.predict_proba(vetor).max()
    return f"Sentimento: {'Positivo' if pred == 1 else 'Negativo'} (confiança: {prob:.2f})"

# Interface com Gradio
demo = gr.Interface(
    fn=classificar_review,
    inputs=gr.Textbox(lines=5, label="Digite um review da Microsoft"),
    outputs=gr.Textbox(label="Resultado da Classificação"),
    title="Classificador de Reviews - Microsoft",
    description="Este modelo classifica automaticamente se um review é positivo ou negativo com base em PLN."
)

# Executa o app
if __name__ == "__main__":
    demo.launch()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\rrs4_cesar\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


* Running on local URL:  http://127.0.0.1:7864
* To create a public link, set `share=True` in `launch()`.


Traceback (most recent call last):
  File "c:\Users\rrs4_cesar\Desktop\Microsoft_Review\venv\Lib\site-packages\gradio\queueing.py", line 625, in process_events
    response = await route_utils.call_process_api(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<5 lines>...
    )
    ^
  File "c:\Users\rrs4_cesar\Desktop\Microsoft_Review\venv\Lib\site-packages\gradio\route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<11 lines>...
    )
    ^
  File "c:\Users\rrs4_cesar\Desktop\Microsoft_Review\venv\Lib\site-packages\gradio\blocks.py", line 2191, in process_api
    result = await self.call_function(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
    )
    ^
  File "c:\Users\rrs4_cesar\Desktop\Microsoft_Review\venv\Lib\site-packages\gradio\blocks.py", line 1702, in call_function
    prediction = await anyio.to_thread.run_sync(  # type: ignore
                 ^^^^^^^^^