 # Statistical Analysis of Dakota

## Corpus
[Transcription from Sisseton-Wahpeton Oyate Dakotah Language Institute (SWODLI)](https://docs.google.com/spreadsheets/d/1ixJatzeU1vDluVwttnm04a3swpodSOnoiGVlHz1QTkY/edit#gid=1036955163
### Encoding
Data is UTF-8 encoded. 

## Ngram Analysis
### Building a basic n-gram generator

The following code builds a most basic n-gram generator and shows the result of a naive analysis of the data. 

In [20]:
import n_gram

n_gram.main()

Start word:  tipi
2 -gram sentence:
tipi heciya bde kte taku owas omaƞg hena taku owas
3 -gram sentence:
tipi hed waniyaka ciƞ iyagezizi num caḣdi uƞkaġapi Frosty iyaye
4 -gram sentence:
tipi kaġapi. heun sistuƞwaƞ waḣpetuƞ dakota magic hed aƞpetu waṡte
5 -gram sentence:
tipi bde kte hed inawajiƞ kte he tuwe asaƞpi suta


### Using the NLTK library
The module nltk_analysis analyze the data using the NLTK library.

### Word level analysis
Most with highest frequency appears to belong to close categories as expected:
* *he* is proposed to be the Question marker. 
* *kte* is proposed to be the Present Tense marker.
* *de* is proposed to be demonstrative.
* *taku* is translated to "something, what"
* *hena* is proposed to be demonstrative.
* *tuwe* is translated to "who".
* *ka* is proposed to be the definite marker.
* *sni* is the negation marker.
* *uŋ* might be the indefinite marker; when used as pronominal clitic, it marks the dual form.

Other highly-frequent words are common verbs such as "eat".

Bigram analysis also demonstrated some common word pairs. However, trigrams, four-grams and five-grams more likely revealed some bias of the corpus rather than some properties of the language.

#### Analysis with line breakers: provide information about the position of words / phrases in a given sentence 
- *he* appears most frequently in sentence final positions

#### 10-fold cross-validation
To avoid overfitting the data, we run a 10-fold cross-validation on the dataset. Specifically, we tested two things: inversion in frequency rankings and standard deviation of word probability. The result shows that bigram and trigram are very stable.

In [3]:
import nltk_analysis

nltk_analysis.test_words()



Ngram analysis on words
['he', 'kte', 'de', 'taku', 'hena', 'tuwe', 'ṡni', 'k̇a', 'aƞpetu', 'ka', 'kta', 'nina', 'kiƞhaƞ', 'uƞ', 'ḳa']
-----------*------------
['bde kte', 'kta he', 'yaciƞ he', 'aƞpetu de', 'duha he', 'waziya aƞpetu', 'wayata he', 'ye kte', 'de kta', 'mni kata']
-----------*------------
['de kta he', 'takoja tataŋka cepa', 'tataŋka cepa opta', 'cepa opta iyaye', 'opta iyaye ake', 'thumpety thump thump', 'waci wokayake teca', 'wisdodye ohnakapi koka', 'wokayake teca waƞ', 'thump thump thumpety']
-----------*------------


### Substring level analysis
#### Morphemes

The most frequent bigrams are meaningful morphemes --> personal pronoun

##### Drawbacks 
Nonsence substrings.
#### Syllables
##### Assumptions 
##### Observations 

CV is the dominant syllable structure.
Single vowel.
CCV

In [2]:
import nltk_analysis

nltk_analysis.test_substrings()



Ngram analysis on morphemes
['ya', 'wa', 'ta', 'ka', 'ak', 'aƞ', 'pi', 'na', 'iy', 'he', 'ap', 'ca', 'ni', 'te', 'ha', 'ṡ', 'aŋ', 'ah', 'ki', 'ic']
-----------*------------
['iya', 'api', 'aka', 'ica', 'aya', 'haƞ', 'yap', 'kte', 'owa', 'tak', 'was', '̇ni', 'iye', 'aṡ', 'awa', 'ena', 'etu', 'aḣ', 'kiy', 'tok']
-----------*------------
['yapi', 'kiya', 'ṡni', 'wica', 'taku', 'petu', 'ciya', 'aṡt', 'ƞhaƞ', 'waṡ', 'waka', 'hena', 'iyay', 'ṡte', 'tawa', 'waci', 'aƞna', 'tuwe', 'heha', 'kapi']
-----------*------------


Frequency test on syllables
['he', 'wa', 'ya', 'i', 'pi', 'ta', 'pe', 'a', 'na', 'ka', 'ni', 'ca', 'wo', 'ke', 'pa', 'zi', 'hi', 'tu', 'ma', 'ki', 'o', 'kta', 'yaƞ', 'ci', 'ṡ', 'ska', 'ju', 'ku', 'wi', 'ṡni']
