# Language Detection using spacy
<li> our main goal is to keep english tweets collected from twitter about Devoxx France event</li>
<li> the tweets in our dataset are expressed in many languages such as french, english... </li>
<li> Here we are using per trained [spaCy models](https://spacy.io/) to detect lenguage of the tweets.</li>

<a href="#a">1. Needed Libraries</a><br>
<a href="#b">2. Dataset</a><br>
<a href="#c">3. Implementation with langdetect</a><br>
<a href="#d">4. Implementation with spacy</a><br>
<a href="#e">5. Conclusion</a><br>

# <a id="a">1. Needed libraries</a>

In [1]:
!pip install spacy_cld

Collecting spacy_cld
  Downloading https://files.pythonhosted.org/packages/e3/3b/f5344007259b5beb0a8e0d7b9e6b0d2c5c4dcfe674bc94b7497bcc201ee0/spacy_cld-0.1.0.tar.gz
Collecting pycld2>=0.31 (from spacy_cld)
[?25l  Downloading https://files.pythonhosted.org/packages/21/77/8525fe5f147bf2819c7c9942c717c4a79b83f8003da1a3847759fb560909/pycld2-0.31.tar.gz (14.3MB)
[K     |████████████████████████████████| 14.3MB 3.4MB/s 
Building wheels for collected packages: spacy-cld, pycld2
  Building wheel for spacy-cld (setup.py) ... [?25l- \ done
[?25h  Created wheel for spacy-cld: filename=spacy_cld-0.1.0-cp36-none-any.whl size=4065 sha256=0aaf2cf0ad1d7b249f2607fc241c8fa78c214f9ed7d156efd3dccf3d96fa0030
  Stored in directory: /tmp/.cache/pip/wheels/7e/a6/a5/604befa6807cc78a6852be9e933c080362b2498fca796cd34e
  Building wheel for pycld2 (setup.py) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Created wheel for pycld2: fil

In [2]:
import numpy as np 
import pandas as pd
import re
import spacy
from spacy_cld import LanguageDetector


import langdetect
import langid

import os
print(os.listdir("../input"))

['data.csv']


In [3]:
# function for data cleaning..
def remove_xml(text):
    return re.sub(r'<[^<]+?>', '', text)

def remove_newlines(text):
    return text.replace('\n', ' ') 
    

def remove_manyspaces(text):
    return re.sub(r'\s+', ' ', text)

def clean_text(text):
    text = remove_xml(text)
    text = remove_newlines(text)
    text = remove_manyspaces(text)
    return text

# <a id="b">2. Dataset</a>

In [4]:
df = pd.read_csv('../input/data.csv')
df.head()

Unnamed: 0,tweets
0,RT @Alexandre_Roman: #PKS 1.4 is out 🥳\r\nSo m...
1,"RT @AmelieBenoit33: Hello public de #DevoxxFr,..."
2,#PKS 1.4 is out 🥳\r\nSo many features:\r\n - #...
3,cc @val_deleplace it is the article we were ta...
4,RT @kaastore: 😁 un sourire = un cadeau 🎁\r\nKa...


In [5]:
#Loading dataset  in tweets.
tweets    = df['tweets']

Cleaning the Data in the dataset by removing XML, extra spaces, and newlines.

In [6]:
for line in tweets:
    line=clean_text(line)
df.head()

Unnamed: 0,tweets
0,RT @Alexandre_Roman: #PKS 1.4 is out 🥳\r\nSo m...
1,"RT @AmelieBenoit33: Hello public de #DevoxxFr,..."
2,#PKS 1.4 is out 🥳\r\nSo many features:\r\n - #...
3,cc @val_deleplace it is the article we were ta...
4,RT @kaastore: 😁 un sourire = un cadeau 🎁\r\nKa...


Here is the view of samll portion of cleaned dataset taken for language detection.
#### the code below is needed in the next sections :

# <a id="c">3. Implementation with langdetect </a>

In [7]:
"""
result = str(result[0])[:2] : keeping the most dominant language wich is situated 
in the 1st index, and we store the first 2 characters
"""

languages_langdetect = []

# the try except blook because there is some tweets contain links
for line in tweets:
    try:
        result = langdetect.detect_langs(line)
        result = str(result[0])[:2]
    except:
        result = 'unknown'
    
    finally:
        languages_langdetect.append(result)

# <a id="d">4. Implementation with spacy </a>

In [8]:
nlp = spacy.load('en')
language_detector = LanguageDetector()
nlp.add_pipe(language_detector)

In [9]:
"""
doc._.languages returns : list of str
like : ['fr'] -> french
       ['en'] -> english
       [] -> empty
       ['fr','en'] -> french (the most dominant in a tweet) and english (least dominant)
"""

tweets          = df['tweets']
languages_spacy = []

for e in tweets:
    doc = nlp(e)
    # cheking if the doc._.languages is not empty
    # then appending the first detected language in a list
    if(doc._.languages):
        languages_spacy.append(doc._.languages[0])
    # if it is empty, we append the list by unknown
    else:
        languages_spacy.append('unknown')

### Adding a column in the dataframe containing the language of the tweet

In [10]:
df['languages_spacy'] = languages_spacy
df['languages_langdetect'] = languages_langdetect

* **tweets :** denotees the tweets of the dataset
* **languages_spacy :** It shows the sentence which were found to be in english (en) by spacy model.
* **languages_langdetect :** It shows the language of the sentence.

In [11]:
df.head()

Unnamed: 0,tweets,languages_spacy,languages_langdetect
0,RT @Alexandre_Roman: #PKS 1.4 is out 🥳\r\nSo m...,en,en
1,"RT @AmelieBenoit33: Hello public de #DevoxxFr,...",unknown,fr
2,#PKS 1.4 is out 🥳\r\nSo many features:\r\n - #...,en,en
3,cc @val_deleplace it is the article we were ta...,en,en
4,RT @kaastore: 😁 un sourire = un cadeau 🎁\r\nKa...,unknown,fr


# <a id="e">5. Conclusion</a>
<li>spacy returns 1582 en tweets </li>
<li>langdetect returns 1018 en tweets</li>

In [12]:
df['languages_spacy'].value_counts()

fr         1813
en         1582
unknown     683
mfe           1
da            1
crs           1
lb            1
ca            1
sa            1
gl            1
Name: languages_spacy, dtype: int64

In [13]:
df['languages_langdetect'].value_counts()

fr         2701
en         1020
nl          117
ca           71
ro           31
da           22
de           22
it           21
af           19
es           10
sl            9
no            8
sv            7
tr            7
so            4
pt            4
cs            2
sk            2
fi            2
pl            1
unknown       1
id            1
hr            1
sw            1
et            1
Name: languages_langdetect, dtype: int64

In [14]:
df.to_csv('Detected_Languages.csv',index=False)

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
