## **Tapas - WebScraping Project**


<img src="https://scontent-gig2-1.xx.fbcdn.net/v/t1.6435-9/67064783_1590702974393670_4265688551187808256_n.jpg?_nc_cat=111&ccb=1-5&_nc_sid=e3f864&_nc_eui2=AeE_Z804yhpjzF7_JAf_PnTHWNIs4NRB_E5Y0izg1EH8TsrAj0E_gbN2MOy5PvW0ZgkHk7rXzlw_12lQjclXLBEf&_nc_ohc=65rkqjb3FksAX-ELv1a&_nc_ht=scontent-gig2-1.xx&oh=00_AT8lz-1dWIgLQ38E6C1uNO00oTxw-H6hXQQ0BeN7Ep7suw&oe=6297B2C6" width="100%">

### **Introduction**

Established in 2012, headquartered in Los Angeles with key global operations in Seoul, South Korea, and Beijing, China, Tapas Media is one of the fastest-growing digital publishing platforms of webcomics and novels in North America. Tapas has created a community of more than 9M registered users with stories from 68,000 creators and published over 99,000 stories to date.

**Disclaimer:** This is a personal project to practice webscraping skills and exploratory data analysis. I do not recommend to use for other purposes. Use it at your own risk.

### **Libraries**

Tapas has a system page that uses JSON files to handle the page items. It's easy to deal and manipulate. We will use only the main tools for the project: 

* Request for the website requests.
* Pandas for file handling.
* bs4 for HTML extraction. 

If you wanna replicate, maybe you need to install these packages with PIP command.

In [1]:
import time
import requests
import pandas as pd

from tqdm.notebook import tqdm, tnrange
from bs4 import BeautifulSoup

### **Variables**

Let's define our URL for scraping.

In [2]:
url = 'https://tapas.io/comics'

We have some important parameters to define in our requests.

* **b**: ALL (List all items).
* **g**: 0 (Items genre set to All Genres).
* **f**: PRM (Only premium items).
* **s**: LIKE (Ordered by likes).
* **pageNumber**: The dynamic page value.
* **pageSize**: 20 (Number of items per page).

Let's focus only on the premium webtoons, because the free content has above 80k items. We will only manipulate the **pageNumber** parameter for the scraper. 

In [4]:
params = {
    'b': 'ALL',
    'g': 0,
    'f': 'PRM',
    's': 'LIKE',
    'pageNumber': 1,
    'pageSize': 20
}

For the server not forbbiden our access, we need to pass a user agent and accepts only json files for manipulation.

In [5]:
headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36',
}

json_headers = headers.copy()
json_headers['accept'] = 'application/json'

## **Scraper**

I will use a for loop to scrape the data from the pages, but I need to know how many pages are available. Usually, we can extract the page number from the HTML, in this case, we can check directly from the JSON file. Let's make the request to check it.

In [6]:
req = requests.get(url, headers=json_headers, params=params)
req.status_code

200

We can check the total page inside the **pager** key in the JSON.

In [7]:
total_page = req.json()['data']['pager']['total_page']
total_page

37

There is a total of 37 pages with 20 items each. This give us a maximum of 740 items. If one error occurs in the process, we will lose all the progress. One of the possibilities is to write directly into a file, but it will be a heavy memory consumer. As we are dealing with a notebook, I will write the data on a dictionary, as the key is the page, and the value it's the manga metadata.

In [17]:
pages_data = {}

The most of informations about the webtoons are not available on the search page. In the info page of the webtoon we can access an amount of information. We will make a new request for each item to get these information. Let's encapsulate the scrap process on a function.

In [16]:
def get_item_information(info):    
    req = requests.get(info, headers=headers)
    if req.status_code != 200:
        raise requests.ConnectionError(f'Connection failed. [{req.status_code}]')
    
    soup = BeautifulSoup(req.text, 'html.parser')
    
    creators = [c.text for c in soup.find('ul', attrs={'class': 'creator-section'}).findAll('a', attrs={'class': 'name'})]
    genres = [g.text for g in soup.find('div', attrs={'class': 'info-detail__row'}).findAll('a')]
    stats = soup.find('div', attrs={'class': 'stats'}).findAll('a')
    views = stats[0]['data-title']
    subscribers = stats[1]['data-title']
    likes = stats[2]['data-title']
    banner = soup.find('div', attrs={'class': 'js-top-banner'})['style'].split(';')[1][22:-1]
    details = soup.find('span', attrs={'class': 'description__body'}).text.strip()
    tags = [t.text for t in soup.findAll('a', attrs={'class': 'tags__item'})]
    episodes = soup.find('p', attrs={'class': 'episode-cnt'}).text.strip().split(' ')[0]
    released = soup.find('li', attrs={'class': 'episode-list__item'}).find('p', attrs={'class': 'additional'}).span.text
    
    return [
        creators, genres, views, subscribers, likes, banner, details, tags, episodes, released
    ]


And now for the main part, let's scrape!

In [20]:
for i in tnrange(total_page, desc='Pages'):
    page = i + 1 
    params['pageNumber'] = page

    if pages_data.get(page, []):
        continue
    
    req = requests.get(url, headers=json_headers, params=params)
    if req.status_code != 200:
        raise requests.ConnectionError(f'Connection failed. [{req.status_code}]')
    
    html = req.json()['data']['body']
    soup = BeautifulSoup(html, 'html.parser')
        
    data = []
    items = soup.findAll('li')
    for item in tqdm(items, desc=f'Items of Page {page}'):
        title = item.find('img')['alt']
        item_id = item.div.a['data-series-id']
        # genre = item.p.a.text
        # stats = item.find('span', attrs={'class': 'item__stat'}).text
        link = 'https://tapas.io' +  item.div.a['href'] + '/info'
        cover = item.find('img')['src']
                
        item_data = [title, item_id, link, cover]
        item_data.extend(get_item_information(link))
        
        data.append(item_data)
        
        time.sleep(1)
        
    pages_data[page] = data

Pages:   0%|          | 0/37 [00:00<?, ?it/s]

Items of Page 1:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 2:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 3:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 4:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 5:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 6:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 7:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 8:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 9:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 10:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 11:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 12:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 13:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 14:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 15:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 16:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 17:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 18:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 19:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 20:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 21:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 22:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 23:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 24:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 25:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 26:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 27:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 28:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 29:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 30:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 31:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 32:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 33:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 34:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 35:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 36:   0%|          | 0/20 [00:00<?, ?it/s]

Items of Page 37:   0%|          | 0/17 [00:00<?, ?it/s]

In [26]:
pages_data[1][0]

['My Gentle Giant',
 '140273',
 'https://tapas.io/series/My-Gentle-Giant/info',
 'https://d30womf5coomej.cloudfront.net/sa/55/a77bd21b-e4f6-4934-b316-0b9af3cb7bcc_z.jpg',
 ['EmAuthor'],
 ['BL', 'LGBTQ+', 'Slice of life'],
 '29,109,978 views',
 '241,427 subscribers',
 '2,934,170 likes',
 'https://d30womf5coomej.cloudfront.net/sa/01/c44877f9-be85-4558-bb97-cd3f543a5938.jpg',
 "Jun Watanabe is your average outcast. He's a third year in high school, but he looks like a first year. He wants to play soccer, but he sucks at sports. And to top it all off....he's gay. However, when Akihiro, the school giant, comes to his rescue, Jun finds out that looks can be deceiving, first loves are often full of misunderstandings, and what matters most, is not how you look, but what's inside your heart.\n\nSupport Us on Patreon: https://www.patreon.com/AUTHOREAB",
 ['#gay',
  '#soft',
  '#comedy',
  '#Angst',
  '#bl',
  '#fluff',
  '#first_love',
  '#slice_of_life',
  '#Pure_Babies',
  '#gay',
  '#soft',
 

Above we can see the information about **My Gentle Giant**. The data are not specified, so let's define a columns variable.

In [22]:
columns = [
    'title', 'item_id', 'link', 'cover', 'creators', 'genres', 'views', 'subscribers', 'likes', 'banner', 'details', 'tags', 'episodes', 'released'
]

Each key of the dictionary contains a page data. Let's mount the data on a new list and pass to a dataframe.

In [27]:
full_data = []
for i in pages_data.values():
    full_data.extend(i)
    
df = pd.DataFrame(full_data, columns=columns)
df.head()

Unnamed: 0,title,item_id,link,cover,creators,genres,views,subscribers,likes,banner,details,tags,episodes,released
0,My Gentle Giant,140273,https://tapas.io/series/My-Gentle-Giant/info,https://d30womf5coomej.cloudfront.net/sa/55/a7...,[EmAuthor],"[BL, LGBTQ+, Slice of life]","29,109,978 views","241,427 subscribers","2,934,170 likes",https://d30womf5coomej.cloudfront.net/sa/01/c4...,Jun Watanabe is your average outcast. He's a t...,"[#gay, #soft, #comedy, #Angst, #bl, #fluff, #f...",176,"Sep 28, 2020"
1,DaiMaou,36492,https://tapas.io/series/daimaou/info,https://d30womf5coomej.cloudfront.net/sa/ee/47...,[Amanduur],"[BL, Comedy, Fantasy]","24,186,270 views","103,218 subscribers","2,453,503 likes",https://d30womf5coomej.cloudfront.net/sa/e9/5b...,TL;DR: Shitty comedy masquerading as an actual...,"[#gay, #Fantasy, #romance, #comedy, #demons, #...",479,"Jul 08, 2020"
2,Idiots Don't Catch Colds,67447,https://tapas.io/series/Idiots-Dont-Catch-Cold...,https://d30womf5coomej.cloudfront.net/sa/6e/68...,[Aina Palm],[BL],"18,534,129 views","132,648 subscribers","1,851,428 likes",https://d30womf5coomej.cloudfront.net/sa/49/92...,There is only one guy Souta absolutely can't s...,"[#romance, #drama, #comedy, #Soccer, #sports, ...",233,"Sep 02, 2020"
3,Jamie,110007,https://tapas.io/series/Jamie/info,https://d30womf5coomej.cloudfront.net/sa/a9/05...,"[Bre Indigo, Tami]","[LGBTQ+, Drama, Slice of life]","14,263,432 views","126,167 subscribers","1,452,529 likes",https://d30womf5coomej.cloudfront.net/sa/94/ce...,[ Coming of Age | LGBTQ+ | Young Adult ]\n\nAt...,"[#friendship, #queer, #crush, #lgbt, #lgbtq, #...",138,"Jun 04, 2020"
4,FANGS,155459,https://tapas.io/series/fangscomic/info,https://d30womf5coomej.cloudfront.net/sa/18/a9...,[Sarah Andersen],"[Romance, Comedy]","36,237,626 views","160,308 subscribers","1,343,073 likes",6,Vamp is three hundred years old but in all tha...,[],78,"Oct 31, 2019"


Some data need to be processed. Like the **views**, **subscribers** and **likes** columns. Let's save this data on a file!

In [28]:
df.to_csv('data.csv', index=False)

### **Contact**

If you have any questions or suggestions, send me an email to victor.soeiro.araujo@gmail.com