Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application: classify core developers irc output #3744

Open
karlnapf opened this issue Mar 23, 2017 · 18 comments
Open

Application: classify core developers irc output #3744

karlnapf opened this issue Mar 23, 2017 · 18 comments

Comments

@karlnapf
Copy link
Member

This should be a fun entrance task is useful for any kind of applied project with Shogun. The task is simple: Train a model that can identify core developers of our project, based on their IRC behaviour (from the logs)

Steps

  • Download logs, structure them nicely
  • Extract features and labels (sentences, and who wrote them)
  • Pick a nice feature representation
  • Optional: Implement a new feature converter for string features in Shogun
  • Train a multiclass model
  • Tune parameters
  • Write a nice notebook
@karlnapf karlnapf changed the title Classify core developers irc output Application: classify core developers irc output Mar 23, 2017
@ghost
Copy link

ghost commented Mar 24, 2017

I'm repeatedly encountering internal server errors while opening most of the logs. Any other way I can get these logs?

@karlnapf
Copy link
Member Author

We are fixing the link atm. Check back in a few days

@minxuancao
Copy link
Contributor

I'm interested in this project. I'm new to open source and github. I want to confirm what 'log' means. Is it the commit history?

@karlnapf
Copy link
Member Author

No it is the IRC chat logs, they are currently offline, but we might put them online again.
In order to build the system, you can use any other chat log on the web. Then, when you are done, and we have put our chat logs back online, you can just apply it to ours ....

@minxuancao
Copy link
Contributor

Thanks!

@tingpan
Copy link
Contributor

tingpan commented Mar 27, 2017

I am interested in this one! I will have a try!

@karlnapf
Copy link
Member Author

Start with bag of words representations, see e.g. here: https://github.com/karlnapf/machine_learning_course/blob/master/classification.ipynb

@karlnapf
Copy link
Member Author

Or train a NN ;)

@tboex
Copy link

tboex commented Mar 29, 2017

Is it also possible for me to work on this one?

@karlnapf
Copy link
Member Author

Everyone is free to try their ting.

@moizsajid
Copy link

Are we supposed to label the data ourselves?

@karlnapf
Copy link
Member Author

karlnapf commented Apr 3, 2017

Chat logs are labelled by construction

@moizsajid
Copy link

So basically our task boils down to classifying whether a given chat post came from a Shogun Core Developer or not?

@karlnapf
Copy link
Member Author

karlnapf commented Apr 5, 2017

Could do that. But the task is to classify who of the core devs

@moizsajid
Copy link

I have trained a Naive Bayes classifier on the 1,941 chat log files. However, I am getting a very low accuracy (less than 1%). I am searching for the possible bug in my code. https://github.com/moizsajid/shogun-core-developers

@lisitsyn
Copy link
Member

This was supposed to be done using Shogun ;)

@moizsajid
Copy link

Found the major bug! I was not taking the log of the prior probabilities. After this update, I am getting an accuracy of 59.85%.

@moizsajid
Copy link

@lisitsyn I will also try to complete it using Shogun just to verify my results. Is there a Bag of Words representation available in Shogun?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants