# **Text Mining Project**

**Nama: Sharon Zefanya Setiawan**

**NIM: 2501961022**

**Kelas: LA09**

## 1. Scraping

*Scraping* atau *web scraping*, adalah proses ekstraksi informasi dari halaman web dengan otomatis menggunakan program komputer atau bot. Dalam konteks ini, program tersebut mengakses dan mengumpulkan data dari berbagai bagian situs web, seperti teks, gambar, atau tabel, untuk kemudian diolah atau disimpan. Teknik *scraping* memungkinkan untuk mengambil informasi dari web secara efisien, meskipun perlu diperhatikan bahwa beberapa situs web memiliki kebijakan yang mengatur atau melarang penggunaan scraping untuk melindungi privasi atau hak cipta.

In [None]:
!pip install newspaper3k
!pip install xmltodict



In [None]:
# libraries
import pandas as pd
import tldextract
import requests
import xmltodict
import random

from newspaper import Article, ArticleException
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### ***XML Sitemap***

Saya menggunakan XML sitemap dalam konteks *web scraping* untuk mempermudah pengumpulan data dari situs web berita. Sebuah XML sitemap adalah file teks khusus yang menyajikan daftar URL pada suatu situs web, memberikan informasi struktur halaman-halaman yang ada. Dengan menggunakan XML sitemap, *web scraping* dapat dilakukan dengan mudah untuk mengidentifikasi dan mengakses URL halaman-halaman berita yang relevan tanpa harus menelusuri seluruh situs. Ini dapat meningkatkan efisiensi dan kecepatan pengumpulan data, sambil mengurangi beban pada server situs web yang diakses.

In [None]:
# list of sitemaps
sitemap_urls = [
    'https://www.cnnindonesia.com/hiburan/musik/sitemap_news.xml',
    'https://wolipop.detik.com/entertainment/sitemap_news.xml',
    'https://www.liputan6.com/lifestyle/sitemap_news.xml',
    'https://celebrity.okezone.com/hotgossip/sitemap.xml',
    'https://www.alinea.id/politik/sitemap.xml',
    'https://www.cnnindonesia.com/internasional/asean/sitemap_news.xml',
    'https://news.detik.com/pemilu/sitemap_news.xml',
    'https://www.liputan6.com/pemilu/sitemap_news.xml',
    'https://www.cnnindonesia.com/olahraga/moto-gp/sitemap_news.xml',
    'https://sport.detik.com/basket/sitemap_news.xml',
    'https://www.liputan6.com/bola/sitemap_news.xml',
    'https://sports.okezone.com/netting/sitemap.xml'
]

In [None]:
# request content and convert to dictionary
def get_selected_urls_from_sitemap(sitemap_url, num_urls=50):
    try:
        sitemap_data = xmltodict.parse(requests.get(sitemap_url).text)
        urls = sitemap_data['urlset']['url']
        random.shuffle(urls)
        selected_urls = [url['loc'] for url in urls[:num_urls]]
        return selected_urls
    except Exception as e:
        print(f"Failed to retrieve sitemap from URL: {sitemap_url}")
        print(f"Error: {str(e)}")
        return []

In [None]:
# create a dictionary to store the selected URLs for each sitemap
selected_urls_dict = {}

# iterate through the sitemap URLs and retrieve selected URLs
for sitemap_url in sitemap_urls:
    selected_urls = get_selected_urls_from_sitemap(sitemap_url)
    selected_urls_dict[sitemap_url] = selected_urls

# Print the selected URLs for each sitemap
for sitemap_url, selected_urls in selected_urls_dict.items():
    print(f"Sitemap: {sitemap_url}")
    print("\n".join(selected_urls))
    print("\n" + "="*50 + "\n")

Sitemap: https://www.cnnindonesia.com/hiburan/musik/sitemap_news.xml
https://www.cnnindonesia.com/hiburan/20231020155551-227-1013888/pink-batalkan-konser-usai-kena-infeksi-pernapasan
https://www.cnnindonesia.com/hiburan/20231016061921-227-1011662/coldplay-bakal-jual-tiket-tambahan-di-jakarta-rp315-ribu-hari-ini
https://www.cnnindonesia.com/hiburan/20231103141219-227-1019628/konser-baru-pekan-depan-fan-taylor-swift-gelar-kamping-sejak-juni
https://www.cnnindonesia.com/hiburan/20231106151731-227-1020575/konser-morrissey-9-november-di-singapura-dibatalkan
https://www.cnnindonesia.com/hiburan/20231016113547-227-1011757/netizen-usai-menang-tiket-tambahan-coldplay-still-speechless
https://www.cnnindonesia.com/hiburan/20231013074604-227-1010667/ady-larang-naff-nyanyikan-lagu-lagu-ciptaannya
https://www.cnnindonesia.com/hiburan/20231102122247-230-1019058/infografis-tur-taylor-swift-siap-jadi-tercuan-sedunia
https://www.cnnindonesia.com/hiburan/20231019084418-227-1013177/jungkook-bts-dipastikan

In [None]:
# identify media
def identify_media(source_url):
    if 'alinea' in source_url:
        return 'Alinea'
    elif 'cnnindonesia' in source_url:
        return 'CNN Indonesia'
    elif 'detik' in source_url:
        return 'Detik'
    elif 'liputan6' in source_url:
        return 'Liputan 6'
    elif 'okezone' in source_url:
        return 'Okezone'
    else:
        return 'Unknown'

In [None]:
# label article
def label_article(url):
    urls = url.lower()

    if 'hiburan' in urls or 'entertainment' in urls or 'lifestyle' in urls or 'celebrity' in urls:
        return 'Entertainment'
    elif 'internasional' in urls or 'pemilu' in urls or 'politik' in urls:
        return 'Politics'
    elif 'sport' in urls or 'bola' in urls or 'sports' in urls or 'olahraga' in urls:
        return 'Sports'
    else:
        return 'Others'

In [None]:
scraped_data = {'text': [], 'media': [], 'label': []}

In [None]:
# iterate through the selected URLs and scrape data
for sitemap_url, selected_urls in selected_urls_dict.items():
    for url in selected_urls:
        try:
            response = requests.get(url, timeout=120)
            article = Article(url)

            article.download()
            article.parse()

            if article.text:
                # Scrape article text
                scraped_data['text'].append(article.text)

                # Scrape media source
                source_url = article.source_url
                source = identify_media(source_url)
                scraped_data['media'].append(source)

                # Label article based on the URL
                label = label_article(url)
                scraped_data['label'].append(label)
            else:
                print(f"Failed to scrape URL: {url} (Empty article text)")
        except (requests.exceptions.RequestException, ArticleException) as e:
            print(f"Failed to scrape URL: {url}")
            print(f"Error: {str(e)}")
        except Exception as e:
            print(f"An unexpected error occurred while scraping URL: {url}")
            print(f"Error: {str(e)}")

In [None]:
df = pd.DataFrame(scraped_data)

df

Unnamed: 0,text,media,label
0,--\n\nPink mengungkapkan dirinya mengidap infe...,CNN Indonesia,Entertainment
1,--\n\nColdplay bakal menjual Infinity Tickets ...,CNN Indonesia,Entertainment
2,--\n\nTaylor Swift akan resmi memulai leg inte...,CNN Indonesia,Entertainment
3,--\n\nPenyanyi asal Inggris Morrissey resmi me...,CNN Indonesia,Entertainment
4,--\n\nSejumlah penggemar Coldplay berhasil mem...,CNN Indonesia,Entertainment
...,...,...,...
590,RENNES - Jonatan Christie speechless alias sul...,Okezone,Sports
591,"MOMEN tak jujur pebulu tangkis Denmark, Kim As...",Okezone,Sports
592,JAKARTA - Yeremia Erich Yoche Yacob Rambitan m...,Okezone,Sports
593,HASIL French Open 2023 akan dibahas di sini. G...,Okezone,Sports


In [None]:
# save to google drive path
destination_path = '/content/drive/My Drive/UTS Text Mining/scraping-dataset.csv'

In [None]:
# save as csv file
df.to_csv(destination_path, index=False)