## 머신러닝을 이용한여 언어 감지 서비스 구축

### 1. 연구 목표 설정

- 유사서비스 : 파파고, 구글 번역
- 개요
    - 번역 서비스중 언어 감지 파트는 머신러닝의 지도학습법 중 분류를 사용하겟다
    - 알파벳을 사용하는 영어권에서는 알파벳 언어별로 알파벳의 사용 빈도가 다르다
- 조건
    - 비 영어권은 개별 방법론(완성형(utf-8), 조합형(euc-kr) 코드를 이용하여 판단) 배제
    - 임시값(100byte) 이내 문자열을 배제, 임시값의 임계값은 변경될수 있다
    - 번역서비스는 딥러닝의 RNN을 활용하여 처리하는데 여기서는 배제,단, 파파고 API를 활용하여 유사하게 구현
    - 서비스가 오픈하고 데이터가 축적되면 모델을 갱신(언어는 진화하니까) 모델을 다시 학습하고 교체를 진행하는데 원활하게 수행되겠금 처리(전략). 일단 여기서는 데이터 축적

|No|단계|내용|
|:---:|:---|:---|
|1|연구 목표 설정|- 웹서비스<br>- 사용자가 입력한 텍스트를 예측하여 어떤 언어인지 판독한다(영어권,알파벳 사용국가)<br>- 머신러닝의 지도학습-분류를 사용하겟다<br>- 파이프라인구축, 하이터파라미터튜닝을 이용한 최적화 부분은 제외<br>- 정량적인 목표치는 생략(평가배제)<br>- 임시값(100byte) 이내 문자열을 배제<br>- 논문을 통한 주장의 근거를 체크|
|2|데이터 획득/수집|- 실전:다양한 텍스트를 수집, 위키피디아, 법률, 소설등등<br>- 구현:제공데이터를 사용(법령/대본/소설등)|
|3|데이터 준비/통찰/전처리|- 알파벳을 제외한 모든 문자 제거(전처리,정규식)<br>- 텍스트를 알파벳의 출현 빈도로 계산한다(문자계산, 데이터의 수치화)<br>- 데이터는 훈련 데이터(훈련(50), 검증(25))와 테스트 데이터(25)로 나눈다 (훈련:테스트=75:25) 황금비율, 단 바뀔수 있다|
|4|데이터 탐색/통찰/시각화|- 논문의 주장을 증명<br>- 영어권 언어별로 알파벳 출현 빈도가 다르다는 명제를 증명/확인<br>- EDA 분석(시각화)를 이용하여 확인, 선형차트, 바차트 등을 활용|
|5|데이터 모델링 및 모델 구축|- 알고리즘 선정<br>- 학습데이터/테스트데이터 준비<br>- 학습<br>- 예측<br>- 성능평가(학습법,하위 카테고리 까지 검토 평가)<br>- 파이프라인구축을 통하여 알고리즘 체인을 적용, 최적의 알고리즘 조합을 찾는다<br>- 연구 목표에 도착할때까지 반복|
|6|시스템 통합|- 모델 덤프(학습된 알고리즘을 파일로 덤프)<br>- 웹서비스 구축(falsk 간단하게 구성)<br>- 서비스 구축<br>- 모델의 업그레이드를 위한 시스템 추가<br>- 선순화구조를 위한 번역 요청 데이터의 로그 처리->배치학습, 온라인 학습등으로 연결되어 완성|

### 2. 데이터 획득


In [None]:
- 실전: 다양한 텍스트를 수집 위키치디아, 법률, 소설등
    -라이브러리 : request bs4
    -사이트 : https://언어이름.wikipedia.org/wiki/Main_Page
            

In [4]:
import urllib.request as req
from bs4 import BeautifulSoup

In [5]:
#함수의 기본값 주가
target_site = "https://{na_code}.wikipedia.org/wiki/{keyword}".format (na_code = 'en', keyword = "bong")
target_site


'https://en.wikipedia.org/wiki/bong'

In [6]:
#파서 사용 이유 : 대량의 html 을 파싱하기 우해 안전성 고랴ㅕ
soup = BeautifulSoup( req. urlopen(target_site), 'html5lib' )

In [7]:
#데이터 추출
#css selector : #mw-content-text-p
tmp=soup.select('#mw-content-text p')
len(tmp)

17

In [8]:
# p 밑에 있는 모든 텍스트를 리스트에 모아 둔다 => 멤버수가 => 22개
texts = list()
for p in tmp:
    # 멤버 추가
    texts.append(p.text)
    #print(type(p.text). p.text)
len(texts), texts[:2]

(17,
 ['\n',
  'A bong (also water pipe, billy, bing, or moof) is a filtration device generally used for smoking cannabis, tobacco, or other herbal substances.[1]  In the bong shown in the photo, the gas flows from the lower port on the left to the upper port on the right.\n'])

In [9]:
#택스츠 한덩어리로 통합
a= list('helloworld')
a

['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd']

In [None]:
''.join(a)

In [11]:
str_text = ''.join(texts)
len(str_text)

7419

In [12]:
#수집한 데이터를 한개의 덩어리로 확듣

str_text = ''.join(texts)
len(str_text),str_text[:100]

(7419,
 '\nA bong (also water pipe, billy, bing, or moof) is a filtration device generally used for smoking ca')

In [13]:
import re

In [19]:
p = re.compile('[^a-zA-Z]*')

In [20]:
tmp = p.sub('',str_text)
tmp

'AbongalsowaterpipebillybingormoofisafiltrationdevicegenerallyusedforsmokingcannabistobaccoorotherherbalsubstancesInthebongshowninthephotothegasflowsfromthelowerportonthelefttotheupperportontherightInconstructionandfunctionabongissimilartoahookahexceptsmallerandespeciallymoreportableAbongmaybeconstructedfromanyairandwatertightvesselbyaddingabowlandstemapparatusorslidewhichguidesairdownwardtobelowwaterlevelwhenceitbubblesupwardbubblerduringuseTogetfreshairintothebongandharvestthelastremainingsmokeaholeknownasthecarburetorcarbchokebinkrushshottykickholeorsimplyholesomewhereonthelowerpartofthebongabovewaterlevelisfirstkeptcoveredduringthesmokingprocessthenopenedtoallowthesmoketobedrawnintotherespiratorysystemOnbongswithoutsuchaholethebowlandorthestemareremovedtoallowairfromtheholethatholdsthestemBongshavebeeninusebytheHmonginLaosandThailandandalloverAfricaforcenturiesOneoftheearliestrecordedusesofthewordintheWestisintheMcFarlandThaiEnglishDictionarypublishedinwhichdescribesoneofthemeaning

In [21]:
tmp = p.sub('',str_text)
tmp.lower()


'abongalsowaterpipebillybingormoofisafiltrationdevicegenerallyusedforsmokingcannabistobaccoorotherherbalsubstancesinthebongshowninthephotothegasflowsfromthelowerportonthelefttotheupperportontherightinconstructionandfunctionabongissimilartoahookahexceptsmallerandespeciallymoreportableabongmaybeconstructedfromanyairandwatertightvesselbyaddingabowlandstemapparatusorslidewhichguidesairdownwardtobelowwaterlevelwhenceitbubblesupwardbubblerduringusetogetfreshairintothebongandharvestthelastremainingsmokeaholeknownasthecarburetorcarbchokebinkrushshottykickholeorsimplyholesomewhereonthelowerpartofthebongabovewaterlevelisfirstkeptcoveredduringthesmokingprocessthenopenedtoallowthesmoketobedrawnintotherespiratorysystemonbongswithoutsuchaholethebowlandorthestemareremovedtoallowairfromtheholethatholdsthestembongshavebeeninusebythehmonginlaosandthailandandalloverafricaforcenturiesoneoftheearliestrecordedusesofthewordinthewestisinthemcfarlandthaienglishdictionarypublishedinwhichdescribesoneofthemeaning

### 3. 데이터 준비

In [54]:
import os.path, glob
import re

file_list  = glob.glob('./data/train/*.txt')
file_list

['./data/train\\en-1.txt',
 './data/train\\en-2.txt',
 './data/train\\en-3.txt',
 './data/train\\en-4.txt',
 './data/train\\en-5.txt',
 './data/train\\fr-10.txt',
 './data/train\\fr-6.txt',
 './data/train\\fr-7.txt',
 './data/train\\fr-8.txt',
 './data/train\\fr-9.txt',
 './data/train\\id-11.txt',
 './data/train\\id-12.txt',
 './data/train\\id-13.txt',
 './data/train\\id-14.txt',
 './data/train\\id-15.txt',
 './data/train\\tl-16.txt',
 './data/train\\tl-17.txt',
 './data/train\\tl-18.txt',
 './data/train\\tl-19.txt',
 './data/train\\tl-20.txt']

In [56]:
#해당 파일 리스트 형식으로 나온다
len(file_list), type(file_list),file_list[:2]

(20, list, ['./data/train\\en-1.txt', './data/train\\en-2.txt'])

In [57]:
fName = file_list[0]
fName

'./data/train\\en-1.txt'

In [42]:
with open (fName, "r", encoding ='UTF-8') as fq:
    row = fq.readlines()
    
row


['\n',
 '\n',
 '\n',
 '\n',
 'The main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n',
 '\n',
 '\n',
 'Henry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\n',
 'Freshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included a school 

In [71]:
str_row = ' '.join(row)
len(str_row)

6523

In [74]:
p = re.compile('[^a-zA-Z]*')

In [75]:
aa = p.sub('', str_row)
len(aa), aa.lower()

(4595,
 'themainhenryfordmuseumbuildinghousessomeoftheclassroomsforthehenryfordacademyhenryfordacademyisthefirstcharterschoolintheunitedstatestobedevelopedjointlybyaglobalcorporationpubliceducationandamajornonprofitculturalinstitutiontheschoolissponsoredbythefordmotorcompanywaynecountyregionaleducationalserviceagencyandthehenryfordmuseumandadmitshighschoolstudentsitislocatedindearbornmichiganonthecampusofthehenryfordmuseumenrollmentistakenfromalotteryintheareaandtotaledinfreshmanmeetinsidethemainmuseumbuildinginglasswalledclassroomswhileolderstudentsuseaconvertedcarouselbuildingandpullmancarsonasidingofthegreenfieldvillagerailroadclassesareexpectedtoincludeuseofthemuseumartifactsatraditionoftheoriginalvillageschoolswhenthemuseumwasestablishedinitincludedaschoolwhichservedgradeskindergartentocollegetradeschoolagesthelastpartoftheoriginalschoolclosedinthehenryfordlearninginstituteisusingthehenryfordacademymodelforfurthercharterschoolsincludingthepowerhousehighinchicagoandalamedaschoolfor

In [110]:
str_row

'\n \n \n \n The main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n \n \n Henry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\n Freshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included a school which served grades kindergarten 

## 파일 여러개 한번에

In [76]:
import os.path, glob
import re

file_list  = glob.glob('./data/train/*.txt')
file_list

['./data/train\\en-1.txt',
 './data/train\\en-2.txt',
 './data/train\\en-3.txt',
 './data/train\\en-4.txt',
 './data/train\\en-5.txt',
 './data/train\\fr-10.txt',
 './data/train\\fr-6.txt',
 './data/train\\fr-7.txt',
 './data/train\\fr-8.txt',
 './data/train\\fr-9.txt',
 './data/train\\id-11.txt',
 './data/train\\id-12.txt',
 './data/train\\id-13.txt',
 './data/train\\id-14.txt',
 './data/train\\id-15.txt',
 './data/train\\tl-16.txt',
 './data/train\\tl-17.txt',
 './data/train\\tl-18.txt',
 './data/train\\tl-19.txt',
 './data/train\\tl-20.txt']

In [81]:
rowlist = list()
for i in file_list:
    with open (i,"r", encoding = 'utf-8' ) as fq:
        rows = fq.readlines()
    rowlist.append(rows)

In [85]:
rowlist

[['\n',
  '\n',
  '\n',
  '\n',
  'The main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n',
  '\n',
  '\n',
  'Henry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\n',
  'Freshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included 

In [105]:
strlist = list()
for j in rowlist:
    str_rows = ' '.join(j)
    strlist.append(str_rows)
    

In [106]:
strlist

['\n \n \n \n The main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n \n \n Henry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\n Freshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included a school which served grades kindergarten

In [107]:
p = re.compile('[^a-zA-Z]*')

In [114]:
aa = list()

for rows_a in strlist:
    aaa = p.sub('',rows_a)
    aa.append(aaa)
    
    

    
    

In [116]:
aa

['ThemainHenryFordMuseumbuildinghousessomeoftheclassroomsfortheHenryFordAcademyHenryFordAcademyisthefirstcharterschoolintheUnitedStatestobedevelopedjointlybyaglobalcorporationpubliceducationandamajornonprofitculturalinstitutionTheschoolissponsoredbytheFordMotorCompanyWayneCountyRegionalEducationalServiceAgencyandTheHenryFordMuseumandadmitshighschoolstudentsItislocatedinDearbornMichiganonthecampusoftheHenryFordmuseumEnrollmentistakenfromalotteryintheareaandtotaledinFreshmanmeetinsidethemainmuseumbuildinginglasswalledclassroomswhileolderstudentsuseaconvertedcarouselbuildingandPullmancarsonasidingoftheGreenfieldVillagerailroadClassesareexpectedtoincludeuseofthemuseumartifactsatraditionoftheoriginalVillageSchoolsWhentheMuseumwasestablishedinitincludedaschoolwhichservedgradeskindergartentocollegetradeschoolagesThelastpartoftheoriginalschoolclosedinTheHenryFordLearningInstituteisusingtheHenryFordAcademymodelforfurthercharterschoolsincludingthePowerHouseHighinChicagoandAlamedaSchoolforArtDesi

In [126]:
aaaa = list()
for k in aa:
    l=k.lower()
    aaaa.append(l)

In [127]:
aaaa

['themainhenryfordmuseumbuildinghousessomeoftheclassroomsforthehenryfordacademyhenryfordacademyisthefirstcharterschoolintheunitedstatestobedevelopedjointlybyaglobalcorporationpubliceducationandamajornonprofitculturalinstitutiontheschoolissponsoredbythefordmotorcompanywaynecountyregionaleducationalserviceagencyandthehenryfordmuseumandadmitshighschoolstudentsitislocatedindearbornmichiganonthecampusofthehenryfordmuseumenrollmentistakenfromalotteryintheareaandtotaledinfreshmanmeetinsidethemainmuseumbuildinginglasswalledclassroomswhileolderstudentsuseaconvertedcarouselbuildingandpullmancarsonasidingofthegreenfieldvillagerailroadclassesareexpectedtoincludeuseofthemuseumartifactsatraditionoftheoriginalvillageschoolswhenthemuseumwasestablishedinitincludedaschoolwhichservedgradeskindergartentocollegetradeschoolagesthelastpartoftheoriginalschoolclosedinthehenryfordlearninginstituteisusingthehenryfordacademymodelforfurthercharterschoolsincludingthepowerhousehighinchicagoandalamedaschoolforartdesi

### 4. 데이터 탐색

### 5. 데이터 모델링 및 모델 구축

### 6. 시스템 통합