# Instagram Data Clean
- 데이터 정제(특수문자 제거, 중복 제거)
- 대상여부 컬럼 추가

### content 확인하기

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('./data/koreagram_2020.csv', encoding='utf-8-sig')

In [3]:
df['content']

0       Bts' spot! ?? Aquela que fez quest?o de sentar...
1       Happy new year!????It's the new year in Korea!...
2       #사진산책 #happynewyear.새해에는 행복 가득하길!.새해 복 많이 받으세요...
3       ??happy  new year ???2021年も素敵なソイングクと??????????...
4       Feliz a?o nuevo a todos!!! Espero que se cumpl...
                              ...                        
4087    Nouvelle ann?e, nouveau pays, me voici en Cor?...
4088    Lights at Cheonggyecheon Stream? 。。。。。。。。#Cheo...
4089    In the Seoul Sky! ???Views from the tallest bu...
4090    Hello 2020 ??..... #seoul #seoulkora #seoultra...
4091    한국 ????~#laterspamming#sorrynotsorry#seoullove...
Name: content, Length: 4092, dtype: object

### 데이터 정제

In [4]:
# 정규표현식 사용 # 제외한 특수문자 제거
df['content']= df['content'].str.replace(pat=r'[^A-Za-z0-9가-힣#]',repl=r' ',regex=True)
df['content']

0       Bts  spot     Aquela que fez quest o de sentar...
1       Happy new year     It s the new year in Korea ...
2       #사진산책 #happynewyear 새해에는 행복 가득하길  새해 복 많이 받으세요...
3         happy  new year    2021                     ...
4       Feliz a o nuevo a todos    Espero que se cumpl...
                              ...                        
4087    Nouvelle ann e  nouveau pays  me voici en Cor ...
4088    Lights at Cheonggyecheon Stream          #Cheo...
4089    In the Seoul Sky     Views from the tallest bu...
4090    Hello 2020         #seoul #seoulkora #seoultra...
4091    한국      #laterspamming#sorrynotsorry#seoullove...
Name: content, Length: 4092, dtype: object

In [5]:
# 중복 제거
df.drop_duplicates(subset=['content'] , inplace=True)
df = df.reset_index(drop=True)
df

Unnamed: 0,content,date,like,place,tags
0,Bts spot Aquela que fez quest o de sentar...,2020-12-31,60,The Min's - 더 민스,[]
1,Happy new year It s the new year in Korea ...,2020-12-31,840,,[]
2,#사진산책 #happynewyear 새해에는 행복 가득하길 새해 복 많이 받으세요...,2020-12-31,124,선유도공원,"['#사진산책', '#happynewyear.새해에는', '#사진에감성을더하다', ..."
3,happy new year 2021 ...,2020-12-31,80,,"['#ソインク?ク', '#????????', '#徐仁?', '#seoinguk', ..."
4,Feliz a o nuevo a todos Espero que se cumpl...,2020-12-31,41,분당구,"['#corea', '#coreano', '#coreadelsur', '#a?onu..."
...,...,...,...,...,...
3582,Nouvelle ann e nouveau pays me voici en Cor ...,2020-01-01,492,서울,"['#seoul', '#coreedusud', '#southkorea', '#gye..."
3583,Lights at Cheonggyecheon Stream #Cheo...,2020-01-01,30,청계천 서울 빛초롱 축제,"['#CheonggyecheonStream', '#Seoul', '#Korea', ..."
3584,In the Seoul Sky Views from the tallest bu...,2020-01-01,56,"서울스카이 Seoul Sky, Lotte World Tower, Korea","['#TravelBlogger', '#Explore', '#Korea', '#Kor..."
3585,Hello 2020 #seoul #seoulkora #seoultra...,2020-01-01,24,대한민국,"['#seoul', '#seoulkora', '#seoultravel', '#kor..."


In [6]:
# 앞 공백 제거
df['content'] = df['content'].str.lstrip()

### 대상여부 컬럼 추가

In [7]:
df['대상여부'] = 'Y'
df

Unnamed: 0,content,date,like,place,tags,대상여부
0,Bts spot Aquela que fez quest o de sentar...,2020-12-31,60,The Min's - 더 민스,[],Y
1,Happy new year It s the new year in Korea ...,2020-12-31,840,,[],Y
2,#사진산책 #happynewyear 새해에는 행복 가득하길 새해 복 많이 받으세요...,2020-12-31,124,선유도공원,"['#사진산책', '#happynewyear.새해에는', '#사진에감성을더하다', ...",Y
3,happy new year 2021 #...,2020-12-31,80,,"['#ソインク?ク', '#????????', '#徐仁?', '#seoinguk', ...",Y
4,Feliz a o nuevo a todos Espero que se cumpl...,2020-12-31,41,분당구,"['#corea', '#coreano', '#coreadelsur', '#a?onu...",Y
...,...,...,...,...,...,...
3582,Nouvelle ann e nouveau pays me voici en Cor ...,2020-01-01,492,서울,"['#seoul', '#coreedusud', '#southkorea', '#gye...",Y
3583,Lights at Cheonggyecheon Stream #Cheo...,2020-01-01,30,청계천 서울 빛초롱 축제,"['#CheonggyecheonStream', '#Seoul', '#Korea', ...",Y
3584,In the Seoul Sky Views from the tallest bu...,2020-01-01,56,"서울스카이 Seoul Sky, Lotte World Tower, Korea","['#TravelBlogger', '#Explore', '#Korea', '#Kor...",Y
3585,Hello 2020 #seoul #seoulkora #seoultra...,2020-01-01,24,대한민국,"['#seoul', '#seoulkora', '#seoultravel', '#kor...",Y


In [8]:
# 한글로 시작하면 'N'
korean = re.compile(r'[ㄱ-ㅣ가-힣]')
for i in range(len(df['content'])):
    if korean.match(df.loc[i, 'content']):
        df.loc[i, '대상여부'] = 'N'

In [9]:
df['대상여부'].value_counts()

Y    3370
N     217
Name: 대상여부, dtype: int64

In [10]:
df.to_csv('./data/koreagram_2020_clean.csv', index=False, encoding='utf-8-sig')

In [1]:
# End of File