# Instagram Data Clean
- 데이터 정제(특수문자 제거, 중복 제거)
- 대상여부 컬럼 추가

### content 확인하기

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('./data/koreagram_2020.csv', encoding='utf-8-sig')

In [3]:
df['content']

0       Bts spot The one who made this to sit in every...
1       Happy new year     It s the new year in Korea ...
2       #사진산책 #happynewyear 새해에는 행복 가득하길  새해 복 많이 받으세요...
3       happy  new year    2021                      #...
4       Happy or new to everyone I hope all your wishe...
                              ...                        
3582    New year new country is here in South Korea th...
3583    Lights at Cheonggyecheon Stream          #Cheo...
3584    In the Seoul Sky     Views from the tallest bu...
3585    Hello 2020         #seoul #seoulkora #seoultra...
3586    한국      #laterspamming#sorrynotsorry#seoullove...
Name: content, Length: 3587, dtype: object

### 데이터 정제

In [4]:
# 정규표현식 사용 # 제외한 특수문자 제거
df['content']= df['content'].str.replace(pat=r'[^A-Za-z0-9가-힣#]',repl=r' ',regex=True)
df['content']

0       Bts spot The one who made this to sit in every...
1       Happy new year     It s the new year in Korea ...
2       #사진산책 #happynewyear 새해에는 행복 가득하길  새해 복 많이 받으세요...
3       happy  new year    2021                      #...
4       Happy or new to everyone I hope all your wishe...
                              ...                        
3582    New year new country is here in South Korea th...
3583    Lights at Cheonggyecheon Stream          #Cheo...
3584    In the Seoul Sky     Views from the tallest bu...
3585    Hello 2020         #seoul #seoulkora #seoultra...
3586    한국      #laterspamming#sorrynotsorry#seoullove...
Name: content, Length: 3587, dtype: object

In [5]:
# 중복 제거
df.drop_duplicates(subset=['content'] , inplace=True)
df = df.reset_index(drop=True)
df

Unnamed: 0,content,place_final,place,place_content,대상여부,date,like,tags,카카오위치명,경도,위도
0,Bts spot The one who made this to sit in every...,,The Min's - 더 민스,,0,2020-12-31,60,[],,,
1,Happy new year It s the new year in Korea ...,,,,0,2020-12-31,840,[],,,
2,#사진산책 #happynewyear 새해에는 행복 가득하길 새해 복 많이 받으세요...,,선유도공원,,0,2020-12-31,124,"['#사진산책', '#happynewyear.새해에는', '#사진에감성을더하다', ...",,,
3,happy new year 2021 #...,,,,0,2020-12-31,80,"['#ソインク?ク', '#????????', '#徐仁?', '#seoinguk', ...",,,
4,Happy or new to everyone I hope all your wishe...,경기 성남시 분당구,분당구,,1,2020-12-31,41,"['#corea', '#coreano', '#coreadelsur', '#a?onu...",분당정자동카페골목,127.106139,37.370151
...,...,...,...,...,...,...,...,...,...,...,...
3487,New year new country is here in South Korea th...,경복궁,서울,경복궁,2,2020-01-01,492,"['#seoul', '#coreedusud', '#southkorea', '#gye...",경복궁,126.976897,37.577609
3488,Lights at Cheonggyecheon Stream #Cheo...,청계천 서울 빛초롱 축제,청계천 서울 빛초롱 축제,,2,2020-01-01,30,"['#CheonggyecheonStream', '#Seoul', '#Korea', ...",서울빛초롱축제,126.977782,37.569190
3489,In the Seoul Sky Views from the tallest bu...,서울스카이,"서울스카이 Seoul Sky, Lotte World Tower, Korea",,2,2020-01-01,56,"['#TravelBlogger', '#Explore', '#Korea', '#Kor...",서울스카이,127.102544,37.512673
3490,Hello 2020 #seoul #seoulkora #seoultra...,서울,대한민국,서울,1,2020-01-01,24,"['#seoul', '#seoulkora', '#seoultravel', '#kor...",북한산둘레길 1구간소나무숲길,127.009021,37.658889


In [6]:
# 앞 공백 제거
df['content'] = df['content'].str.lstrip()

### 대상여부 컬럼 추가

In [7]:
df['대상여부'] = 'Y'
df

Unnamed: 0,content,place_final,place,place_content,대상여부,date,like,tags,카카오위치명,경도,위도
0,Bts spot The one who made this to sit in every...,,The Min's - 더 민스,,Y,2020-12-31,60,[],,,
1,Happy new year It s the new year in Korea ...,,,,Y,2020-12-31,840,[],,,
2,#사진산책 #happynewyear 새해에는 행복 가득하길 새해 복 많이 받으세요...,,선유도공원,,Y,2020-12-31,124,"['#사진산책', '#happynewyear.새해에는', '#사진에감성을더하다', ...",,,
3,happy new year 2021 #...,,,,Y,2020-12-31,80,"['#ソインク?ク', '#????????', '#徐仁?', '#seoinguk', ...",,,
4,Happy or new to everyone I hope all your wishe...,경기 성남시 분당구,분당구,,Y,2020-12-31,41,"['#corea', '#coreano', '#coreadelsur', '#a?onu...",분당정자동카페골목,127.106139,37.370151
...,...,...,...,...,...,...,...,...,...,...,...
3487,New year new country is here in South Korea th...,경복궁,서울,경복궁,Y,2020-01-01,492,"['#seoul', '#coreedusud', '#southkorea', '#gye...",경복궁,126.976897,37.577609
3488,Lights at Cheonggyecheon Stream #Cheo...,청계천 서울 빛초롱 축제,청계천 서울 빛초롱 축제,,Y,2020-01-01,30,"['#CheonggyecheonStream', '#Seoul', '#Korea', ...",서울빛초롱축제,126.977782,37.569190
3489,In the Seoul Sky Views from the tallest bu...,서울스카이,"서울스카이 Seoul Sky, Lotte World Tower, Korea",,Y,2020-01-01,56,"['#TravelBlogger', '#Explore', '#Korea', '#Kor...",서울스카이,127.102544,37.512673
3490,Hello 2020 #seoul #seoulkora #seoultra...,서울,대한민국,서울,Y,2020-01-01,24,"['#seoul', '#seoulkora', '#seoultravel', '#kor...",북한산둘레길 1구간소나무숲길,127.009021,37.658889


In [9]:
# 한글로 시작하면 'N'
korean = re.compile(r'[ㄱ-ㅣ가-힣]')
for i in range(len(df['content'])):
    if korean.match(str(df.loc[i, 'content'])):
        df.loc[i, '대상여부'] = 'N'

In [10]:
df['대상여부'].value_counts()

Y    3275
N     217
Name: 대상여부, dtype: int64

In [10]:
df.to_csv('./data/koreagram_2020_clean.csv', index=False, encoding='utf-8-sig')

In [1]:
# End of File