<a href="https://colab.research.google.com/github/sagorbrur/codeswitch/blob/master/notebook/codeswitch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# codeswitch
CodeSwitch is a NLP tool, can use for language identification, pos tagging, name entity recognition, sentiment analysis of code mixed data.

## Installation

In [2]:
!pip install codeswitch

In [3]:
import torch
torch.__version__

'1.6.0+cu101'

## Language Identification
You can identify language of following mixed language:

* spanish-english(spa-eng)
* hindi-english(hin-eng)
* nepali-english(nep-eng)

Here is an example of hindi english mixed data

In [6]:
from codeswitch.codeswitch import LanguageIdentification
lid = LanguageIdentification('hin-eng') 
# for spanish-english use 'spa-eng', 
# for nepali-english use 'nep-eng'
text = "Khaike paan banaras wala old and new mix." # your code-mixed sentence 
results = lid.identify(text)
for result in results:
  print(result['word']+"\t"+result['entity'])

Downloading pretrained model. It will take time according to model size and your internet speed


Some weights of the model checkpoint at sagorsarker/codeswitch-hineng-lid-lince were not used when initializing BertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model Download Completed!
[CLS]	other
K	hin
##hai	hin
##ke	hin
pa	hin
##an	hin
bana	hin
##ras	hin
wala	hin
old	en
and	en
new	en
mix	en
.	other
[SEP]	other


## POS Tagging
You can find pos tag of following mixed language

* spanish-english(spa-eng)
* hindi-english(hin-eng)
Here is an example of finding pos tag from spanish-english mixed data

In [9]:
from codeswitch.codeswitch import POS
pos = POS('spa-eng')
# for hindi-english use 'hin-eng'
text = "no pero yo re... tengo todavía en la casa un sobre así." # your mixed sentence 
results = pos.tag(text)
for result in results:
  print(result['word']+"\t"+result['entity'])


Downloading pretrained model. It will take time according to model size and your internet speed


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1283.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=49.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=711556439.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at sagorsarker/codeswitch-spaeng-pos-lince were not used when initializing BertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model Download Completed!
[CLS]	PUNCT
no	INTJ
pero	CONJ
yo	PRON
re	VERB
.	PUNCT
.	PUNCT
.	PUNCT
ten	VERB
##go	VERB
todavía	ADV
en	ADP
la	DET
casa	NOUN
un	DET
sobre	ADJ
así	ADV
.	PUNCT
[SEP]	PUNCT


## NER Tagging
You can find name entity tag of following mixed language

* spanish-english(spa-eng)
* hindi-english(hin-eng)
Here is an example of finding pos tag from spanish-english mixed data

In [10]:
from codeswitch.codeswitch import NER
ner = NER('spa-eng')
# for hindi-english use 'hin-eng'
text = "Este sábado nuestras alumnas en Imagen Modeling by La Gatita reciben la visita de Monic Abbad, joven..." # your mixed sentence 
results = ner.tag(text)
for result in results:
  print(result['word']+"\t"+result['entity'])


Downloading pretrained model. It will take time according to model size and your internet speed


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1407.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=995526.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=49.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=711559511.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at sagorsarker/codeswitch-spaeng-ner-lince were not used when initializing BertForTokenClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Model Download Completed!
s	B-TIME
##ába	B-TIME
La	B-PER
Ga	I-PER
##tit	I-PER
##a	I-PER
Mon	B-PER
##ic	B-PER
Ab	I-PER
##bad	I-PER


## Sentiment Analysis
You can analyze sentiment of following mixed language
* spanish-english(spa-eng)

Here is an example of spanish-english


In [12]:
from codeswitch.codeswitch import SentimentAnalysis
sa = SentimentAnalysis('spa-eng')
sentence = "El perro le ladraba a La Gatita .. .. lol #teamlagatita en las playas de Key Biscayne este Memorial day"
result = sa.analyze(sentence)
if result[0]['label']=='LABEL_1':
  print("Postive"+"\t"+str(result[0]['score']))
else:
  print("Negative"+"\t"+str(result[0]['score']))

Some weights of the model checkpoint at sagorsarker/codeswitch-spaeng-sentiment-analysis-lince were not used when initializing BertForSequenceClassification: ['bert.embeddings.position_ids']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Postive	0.9587041735649109
