## **데이콘 뉴스 기사 레이블 복구 해커톤**

본 프로젝트는 뉴스 데이터셋의 'category'필드를 복구하는 것을 목적으로 한다.  
데이터셋은 6개의 카테고리로 분류되어야 하며, 6만 행으로 이루어진 csv 파일이다. 파일에는 'id'와 'text'필드만 있다.  
복구해야 하는 카테고리 종류는 다음과 같다.  
0: Business  
1: Entertainment  
2: Politics  
3: Sports  
4: Tech  
5: World


-------------------
Dataset Info.  
news.csv  
id : 샘플 고유 id  
title : 뉴스 기사 제목  
content : 뉴스 기사 전문  

In [None]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout
from tensorflow.keras.models import save_model, load_model
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [None]:
data = pd.read_csv("news.csv")
data_with_labels = pd.read_csv("news_added_labels.csv")

In [None]:
data

Unnamed: 0,id,title,contents
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a..."
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...
...,...,...,...
59995,NEWS_59995,"Dolphins Break Through, Rip Rams For First Win",But that #39;s OK. Because after a 31-14 rout ...
59996,NEWS_59996,"After Steep Drop, Price of Oil Rises",The freefall in oil prices ended Monday on a s...
59997,NEWS_59997,Pro football: Culpepper puts on a show,To say Daunte Culpepper was a little frustrate...
59998,NEWS_59998,Albertsons on the Rebound,The No. 2 grocer reports double-digit gains in...


In [None]:
# 600개 기사에 수기 labeling
data_with_labels

Unnamed: 0,id,title,contents,category
0,NEWS_00000,Spanish coach facing action in race row,MADRID (AFP) - Spanish national team coach Lui...,3
1,NEWS_00001,Bruce Lee statue for divided city,"In Bosnia, where one man #39;s hero is often a...",1
2,NEWS_00002,Only Lovers Left Alive's Tilda Swinton Talks A...,Yasmine Hamdan performs 'Hal' which she also s...,1
3,NEWS_00003,Macromedia contributes to eBay Stores,Macromedia has announced a special version of ...,4
4,NEWS_00004,Qualcomm plans to phone it in on cellular repairs,Over-the-air fixes for cell phones comes to Qu...,4
...,...,...,...,...
595,NEWS_00595,Ward KOs 8-year drought,It #39;s been eight years since a US boxer sto...,3
596,NEWS_00596,Trump Signs Bill Making It Easier For Employer...,Republicans just repealed a major safety regul...,2
597,NEWS_00597,Gymnastics #39; pettiness is all-around,"ATHENS, Greece -- If international gymnastics ...",3
598,NEWS_00598,Group: Israel Violating International Law (AP),AP - Israel has violated international law by ...,5


In [None]:
!pip install nltk



In [None]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# 데이터 전처리 함수 정의
def preprocessing(text):
    # 소문자 변환, 특수문자 제거
    text = text.lower()
    text = re.sub(r'[^\w\s]', ' ', text)

    # 표제어 추출
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    text = ' '.join([lemmatizer.lemmatize(word) for word in words])

    # 불용어 제거
    stop_words = set(stopwords.words('english'))
    words = text.split()
    text = ' '.join([word for word in words if word not in stop_words])

    return text

In [None]:
# 수기 레이블링된 데이터 제외하여 슬라이스(추론 데이터)
data = data.iloc[600:]

In [None]:
# 추론 데이터 전처리
data['title'] = data['title'].apply(preprocessing)
data['contents'] = data['contents'].apply(preprocessing)

# 훈련 데이터 전처리
data_with_labels['title'] = data_with_labels['title'].apply(preprocessing)
data_with_labels['contents'] = data_with_labels['contents'].apply(preprocessing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['title'] = data['title'].apply(preprocessing)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['contents'] = data['contents'].apply(preprocessing)


In [None]:
# 추론 데이터 병합
data['combined_text'] = data['title'] + ' ' + data['contents']

# 훈련 데이터 병합
data_with_labels['combined_text'] = data_with_labels['title'] + ' ' + data_with_labels['contents']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['combined_text'] = data['title'] + ' ' + data['contents']


In [None]:
data['combined_text']

600      u2019t let trump congress tax public infrastru...
601      e voting sceptic use web monitor election tech...
602      john kelly told lawmaker trump build wall alon...
603      hp 39 compaq x pc us new pentium 4 hewlett pac...
604      football site set miaa football committee yest...
                               ...                        
59995    dolphin break rip ram first win 39 ok 31 14 ro...
59996    steep drop price oil rise freefall oil price e...
59997    pro football culpepper put show say daunte cul...
59998    albertsons rebound 2 grocer report double digi...
59999    cassini craft spy saturn moon dione ap ap cass...
Name: combined_text, Length: 59400, dtype: object

In [None]:
data_with_labels['combined_text']

0      spanish coach facing action race row madrid af...
1      bruce lee statue divided city bosnia one man 3...
2      lover left alive tilda swinton talk almost qui...
3      macromedia contributes ebay store macromedia h...
4      qualcomm plan phone cellular repair air fix ce...
                             ...                        
595    ward ko 8 year drought 39 eight year since u b...
596    trump sign bill making easier employer hide wo...
597    gymnastics 39 pettiness around athens greece i...
598    group israel violating international law ap ap...
599    u2019s time democrat get trump train author sp...
Name: combined_text, Length: 600, dtype: object

In [None]:
# 토크나이저와 패딩
max_words = 10000
max_len = 200

tokenizer = Tokenizer(num_words=max_words)

# 추론 데이터 토큰화
tokenizer.fit_on_texts(data['combined_text'])
data_sequences = tokenizer.texts_to_sequences(data['combined_text'])
data_padded_sequences = pad_sequences(data_sequences, maxlen=max_len, padding='post')

# 훈련 데이터 토큰화
tokenizer.fit_on_texts(data_with_labels['combined_text'])
data_with_labels_sequences = tokenizer.texts_to_sequences(data_with_labels['combined_text'])
data_with_labels_padded_sequences = pad_sequences(data_with_labels_sequences, maxlen=max_len, padding='post')

In [None]:
labels = data_with_labels['category']

## **모델링**

1D 합성곱 레이어와 GlobalMaxPooling1D 레이어를 사용하여 텍스트 데이터를 처리하고, 그 후에 Dense 레이어를 추가하여 최종 분류를 수행

In [None]:
# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(data_with_labels_padded_sequences, labels, test_size=0.2, random_state=1028)

# 모델 정의
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))
model.add(Conv1D(128, 5, activation='relu'))
model.add(GlobalMaxPooling1D())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(6, activation='softmax'))  # 6개의 클래스 출력 레이어

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# 모델 훈련
model.fit(X_train, y_train, epochs=50, batch_size=128)

# 모델 평가
test_loss, test_acc = model.evaluate(X_test, y_test)
print(f"테스트 정확도: {test_acc}")


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
테스트 정확도: 0.6666666865348816


In [None]:
# 모델 저장
model.save('trained_model.h5')

# 모델 불러오기
loaded_model = load_model('trained_model.h5')

  saving_api.save_model(


In [None]:
data_padded_sequences

array([[2239,  454,   13, ...,    0,    0,    0],
       [ 476, 1063,  273, ...,    0,    0,    0],
       [ 215, 2085,  319, ...,    0,    0,    0],
       ...,
       [ 993,  335,  246, ...,    0,    0,    0],
       [9845, 1726,   35, ...,    0,    0,    0],
       [3057, 4186, 2651, ...,    0,    0,    0]], dtype=int32)

In [None]:
# 추론하기
submission_df = pd.read_csv('sample_submission.csv')

all_predictions = model.predict(data_padded_sequences)

# 'sample_submission.csv' 파일의 'category' 열에 결과 저장
submission_df['category'].iloc[600:] = np.argmax(all_predictions, axis=1)
submission_df.to_csv('result_submission.csv', index=False)




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission_df['category'].iloc[600:] = np.argmax(all_predictions, axis=1)


In [None]:
result_df = pd.read_csv('result_submission.csv')

In [None]:
result_df

Unnamed: 0,id,category
0,NEWS_00000,-1
1,NEWS_00001,-1
2,NEWS_00002,-1
3,NEWS_00003,-1
4,NEWS_00004,-1
...,...,...
59995,NEWS_59995,3
59996,NEWS_59996,0
59997,NEWS_59997,0
59998,NEWS_59998,2


In [None]:
data_with_labels['category']

0      3
1      1
2      1
3      4
4      4
      ..
595    3
596    2
597    3
598    5
599    2
Name: category, Length: 600, dtype: int64

In [None]:
result_df['category'].iloc[:600] = data_with_labels['category']
result_df.to_csv('final_submission.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df['category'].iloc[:600] = data_with_labels['category']
