GitHub - tanyelai/genetics-of-texts: Genetic algorithm for text classification | Metin sınıflandırma için genetik algoritma

What happens if we try to use genetic algoritm to classify text?

Homework Topic: Text Classification with Genetic Algorithm / Hill Climbing

Find a text classification dataset with at least 100 examples and 2 classes (positive/negative, sports/economy, etc.). Individuals will consist of a list of N words. The first part will represent the first class and the second part will represent the second class. For example, an individual can be represented as ["beautiful", "amazing", ..., "awesome"]["terrible", "bad", ..., "disliked"]. This individual will classify 100 examples according to whether they contain these words and achieve a classification success rate. This success value will be used as the fitness function. The number of words belonging to class 1 and class 2 in the text can be found, and the operation can be performed according to these values. In case of a tie, the class can be randomly determined. Initially, words will be randomly assigned to all individuals. Throughout the process, operations such as crossover and random word replacement will be applied to try to increase the fitness value of the individuals. The word lists of individuals will be optimized using genetic algorithm or hill climbing.

General Evaluation

I believe that using "F1 score" for fitness has a positive impact on the overall process when evaluating the results. Although I haven't conducted a comprehensive experiment to share statistical information, my observation is in this direction.

In almost all runs, when we increase the mutation, the results improve. However, the same is not true for crossover, sometimes it's effective, sometimes it's not. I kept the number of generations at 200 because we can already see the generations underneath in the graph.

The population size is also one of the factors that significantly affect the results, generally, when we increase the population, the results also improve. I didn't use a "very" large population because it turned my process into incredibly long periods and wasn't appropriate for comparison. Using a very large population also negatively affected the performance in terms of time and results.

Additionally, I used an approach that we can consider a bit of a cheat to improve the results. When I did some research, I learned about a method called "elitism." This method guarantees the transfer of the best individual to the next generation. Therefore, we don't lose points in the score. When we work completely randomly, the results naturally have more fluctuations and lower scores.

Example Results

Elitizm and F1 fitness are used for Data1 (food)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
results		results
vocabs		vocabs
LICENSE		LICENSE
README.md		README.md
README.txt		README.txt
ai_hw1.pdf		ai_hw1.pdf
main.ipynb		main.ipynb
prepare_data.ipynb		prepare_data.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What happens if we try to use genetic algoritm to classify text?

General Evaluation

Example Results

About

Releases

Packages

Languages

License

tanyelai/genetics-of-texts

Folders and files

Latest commit

History

Repository files navigation

What happens if we try to use genetic algoritm to classify text?

General Evaluation

Example Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages