SB-CH Corpus

Introduction

The SpinningBytes Swiss German Corpus (SB-CH) is a corpus of Swiss German sentences along with Sentiment Annotations for those sentences.

It contains 165'916 sentences (of which ~70% are in Swiss German), of which 2799 are annotated with sentiment.

The sentiment annotations are done in 5 categories:

UN: Sentences that don't make sense, are gibberish or aren't Swiss German
UNSURE: Sentences with mixed or ambiguous sentiment
NEUTRAL: Sentences which are neither positive nor negative
NEGATIVE: Sentences with negative sentiment
POSITIVE: sentences with positive sentiment

Files

chatmania.csv

A CSV file containing sentences originating from chatmania. The file consists of the following columns:

sentence_id: the unique id of the sentence (int)
sentence_text: the text of the sentence, in quotes (")

facebook.csv

A CSV-File containing information about sentences in facebook posts. Since the content of Facebook posts cannot be shared freely, this file contains information for recreating the sentences in the corpus. To this end, the file contains the unique ID of each Facebook post, along with a sentence number. The sentence number (starting at 0) describes which sentence the entry is for, when the comment is split with the sent_tokenize method in the nltk.tokenize namespace of NLTK (Version 3.2.2).

Concretely, for a facebook comment body facebook_comment and a sentence_number of 1, this would mean:

from nltk.tokenize import sent_tokenize

comment_text = "this is the text of a facebook comment. You need to fetch this from facebook"
split_sentences = sent_tokenize(comment_text, language='german')
target_sentence = split_sentences[1]

The columns of the file are as follows:

comment_id: the id of the facebook comment
status_id: the id of the status this comment was posted on
parent_id: the id of a parent post for this post, if it exists. -1 if there is no parent.
sentence_number: the consecutive sentence number when the comment is tokenized with the sent_tokenize method of NLTK, starting at 0.
md5_hash: the md5-hash of the sentence to verify a correct download
sentence_id: the unique id of the sentence

Fetching Facebook comments

A sample script to fetch the facebook sentences is provided in get_fb_comments.py .

Note that the following fields need to be set at the top of the file:

app_id: the Facebook application id
`app_secret: the Facebook application secret
file_id: the path to the source file (facebook.csv)
result_file: the path to the result file

Alternatively, you can also set access_token in the file directly to an access token.

The script was written for python 3.6

The script can be executed as:

$ python get_fb_comments.py
Scraping facebook.csv Comments: 2018-02-05 14:45:45.636237
[...]
Done!
56945 Comments Processed in 648221.15

Facebook apps can be created according to this guide.

This script is based on Facebook Page Post Scraper

noah.csv

A CSV file mapping NOAH corpus Sentences to SB-CH sentiment annotations.

The columns of the file are as follows:

document_id: the name of the source xml file of the sentence the NOAH corpus
article_id: the id of the <article> tag of the sentence
s_id: the id of the <s> tag of the sentence
md5_hash: the md5 hash of the sentence
sentence_id: the SB-CH sentence id

sms4science.csv

A CSV file mapping sms4science corpus Sentences to SB-CH sentiment annotations.

The columns of the file are as follows:

sms_id: the the sms4science sms id
sentence_number: the number of the sentence when the SMS is split with sent_tokenize()
md5_hash: the md5 hash of the sentence
sentence_id: the SB-CH sentence id

sentiment.csv

A CSV-File containing the sentiment annotations for a subset of the sentences.

The columns are as follows:

sentence_id: the unique id of the sentence
un: the number of times this sentence was annotated with the UN label (Gibberish/Not Swiss-German)
unsure: the number of times this sentence was annotated with the UNSURElabel (Mixed/ambiguous sentiment)
neut: the number of times this sentence was annotated with the NEUTRALlabel (Neither positive nor negative sentiment)
neg: the number of times this sentence was annotated with the NEGATIVElabel (Negative sentiment)
pos: the number of times this sentence was annotated with the POSITIVElabel (Positive sentiment)

Licence

See our homepage for Licence information

Reference

See our homepage for Referencing information

Remarks

This corpus is provided as is. It was cleaned up as best effort, but due to the low-resourced nature of Swiss German, automated cleanup of the corpus is difficult and there are still roughly 30% non-Swiss German sentences in the corpus. The annotations were done by 5 different annotators.

Acknowledgements

The sentiment annotated text contained elements from the following two corpora. They are referenced by ID – in order to obtain the full text, please directly access the original corpora.

Stark, Elisabeth; Ueberwasser, Simone; Ruef, Beni (2009-2015). Swiss SMS Corpus. University of Zurich. www.sms4science.ch
Nora Hollenstein and Noëmi Aepli. "Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging." COLING 2014 (2014): 85 http://kitt.cl.uzh.ch/kitt/noah/corpus

We thank the creators of the SMS4Science and NOAH corpora for their work.

Contact

SpinningBytes AG can be contacted at info@spinningbytes.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SB-CH Corpus

Introduction

Files

chatmania.csv

facebook.csv

Fetching Facebook comments

noah.csv

sms4science.csv

sentiment.csv

Licence

Reference

Remarks

Acknowledgements

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
chatmania.csv		chatmania.csv
facebook.csv		facebook.csv
get_fb_comments.py		get_fb_comments.py
noah.csv		noah.csv
requirements.txt		requirements.txt
sentiment.csv		sentiment.csv
sms4science.csv		sms4science.csv

spinningbytes/SB-CH

Folders and files

Latest commit

History

Repository files navigation

SB-CH Corpus

Introduction

Files

chatmania.csv

facebook.csv

Fetching Facebook comments

noah.csv

sms4science.csv

sentiment.csv

Licence

Reference

Remarks

Acknowledgements

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages