# 1. Load Data

This step loads the text data from the 'artigos.txt' file. The data is read from the specified file path and the first 500 characters are printed to verify that the content is loaded correctly.


In [None]:
import os

# Defining the base path of the project
BASE_PATH = r'repository_path'

# Defining the full path to the 'artigos.txt' file
FILE_PATH = os.path.join(BASE_PATH, "data", "raw", "artigos.txt")

# Reading the data from the text file
with open(FILE_PATH, 'r', encoding='utf-8') as f:
    text_data = f.read()

# Checking the first 500 characters of the text
print(text_data[:500])





imagem 

Temos a seguinte classe que representa um usuário no nosso sistema:

java

Para salvar um novo usuário, várias validações são feitas, como por exemplo: Ver se o nome só contém letras, [**o CPF só números**] e ver se o usuário possui no mínimo 18 anos. Veja o método que faz essa validação:

java 

Suponha agora que eu tenha outra classe, a classe `Produto`, que contém um atributo nome e eu quero fazer a mesma validação que fiz para o nome do usuário: Ver se só contém letras. E aí? Vou


# 2. Text Preprocessing and Tokenization

In this step, the text data is preprocessed. A function is defined to clean the text by removing non-alphabetic characters, tokenizing the text, removing stopwords (common words that don't add much meaning), and lemmatizing the tokens (reducing them to their base form).


In [None]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary resources for tokenization and stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Define a function for text cleaning and tokenization
def clean_and_tokenize(text):
    # Remove non-alphabetic characters and convert text to lowercase
    text = re.sub(r'[^a-zA-Záàâãéèêíóôõúç]+', ' ', text.lower())
    
    # Tokenization
    tokens = word_tokenize(text)
    
    # Remove stopwords (using Portuguese stopwords)
    stop_words = set(stopwords.words('portuguese'))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return tokens

# Split the text into lines and apply the cleaning function
processed_texts = [clean_and_tokenize(line) for line in text_data.split("\n")]

# Check the first 5 processed samples
processed_texts[:5]


# 3. Feature Extraction Using Tfidf

This step uses the `TfidfVectorizer` from scikit-learn to transform the tokenized text into numerical features. TF-IDF stands for Term Frequency-Inverse Document Frequency, which helps to evaluate the importance of each word in the corpus. The `stop_words='english'` option removes common English words, but for Portuguese data, this should be adjusted (if necessary).


In [50]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Convert tokenized texts back to strings
texts_as_strings = [' '.join(tokens) for tokens in processed_texts]

# Extracting features using TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')  # 'english' here refers to ignoring common English words
X = vectorizer.fit_transform(texts_as_strings)

# Placeholder for labels (use actual labels in practice)
y = np.random.randint(2, size=X.shape[0])  # Random binary labels for now

# Check the shape of the features and labels
print("Feature matrix shape:", X.shape)
print("Labels shape:", y.shape)


Feature matrix shape: (39827, 18840)
Labels shape: (39827,)


# 4. Split Data into Training and Test Sets

This step splits the dataset into training and test sets. The data is divided into 80% for training and 20% for testing. This helps evaluate the model's performance on unseen data.


In [51]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shape of the training and test sets
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)


Training features shape: (31861, 18840)
Testing features shape: (7966, 18840)


# 5. Train Model Using Logistic Regression

Here, a logistic regression model is trained using the training data (`X_train`, `y_train`). The trained model is then used to make predictions on the test set. The accuracy of the model is evaluated by comparing the predicted labels with the true labels.


In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train the Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Logistic Regression Model Accuracy: {accuracy}')


Logistic Regression Model Accuracy: 0.5020085362791865


# 6. Train Model Using Random Forest

In this step, a Random Forest classifier is trained. Random Forest is an ensemble learning method that uses multiple decision trees to make a final prediction. The accuracy of the model is then evaluated on the test set.


In [53]:
from sklearn.ensemble import RandomForestClassifier

# Train the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Make predictions on the test set
rf_y_pred = rf_model.predict(X_test)

# Evaluate the model accuracy
rf_accuracy = accuracy_score(y_test, rf_y_pred)
print(f'Random Forest Model Accuracy: {rf_accuracy}')


Random Forest Model Accuracy: 0.503138337936229


# 7. Hyperparameter Tuning for Random Forest

Here, we perform hyperparameter tuning for the Random Forest model using GridSearchCV. This technique searches for the best combination of hyperparameters to improve model performance. After finding the best parameters, the model is re-evaluated on the test set.


In [54]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [10, 20, None],
}

# Initialize GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1)

# Fit the model to the training data
grid_search.fit(X_train, y_train)

# Display the best parameters and the corresponding accuracy
print("Best Parameters:", grid_search.best_params_)
print("Best Cross-validation Score:", grid_search.best_score_)

# Evaluate the best model on the test data
best_rf_model = grid_search.best_estimator_
rf_test_accuracy = best_rf_model.score(X_test, y_test)
print(f'Best Random Forest Test Accuracy: {rf_test_accuracy}')


Best Parameters: {'max_depth': 20, 'n_estimators': 100}
Best Cross-validation Score: 0.5026207845421318
Best Random Forest Test Accuracy: 0.5046447401456189
