## Simple crawler process(sop)
1. Website analysis - static or dynamic web pages - Google Developer Tools
2. Flip Page - how does the URL change when the page changes - Build a loop to achieve page turning
3. Requests - Pretend to be a browser - Write a suitable header - Google Developer Tools
4. Get data
5. Cleaning the data
6. Analyze the data

## think

### Focus on URL
- The URL points to the content we want to obtain
- there are some examples:
  1. Simple page turning：
     - https://5a1ab4b16b638d40d0999.bi47.cc/read/119046/**1**.html
     - https://5a1ab4b16b638d40d0999.bi47.cc/read/119046/**1_2**.html
     - Find the **pattern** and build a loop to **turn pages**
  2. Complex web pages
     - The URL does not change when we turn the page，like：https://item.jd.com/100119813751.html#comment
     - The url will not change when we turn the page
     - So we use Google Developer Tools to get the URL：**Check-Network-Headers-Request URL:**
     - https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1730807463289&body= **There are some omissions here**.
     
     - We found that this URL has many symbols（%%） and no obvious pattern can be seen，This is actually a URL escape mode, related to encoding and encryption
       
     - then we go to **Payload-View URL decoding**，we can find the **body**:
       
     - {"productId":100119813751,"score":0,"sortType":5,"page":2,"pageSize":10,"isShadowSku":0,"rid":0,"fold":1,"bbtf":"","shield":""}
     - then we can construct the URL：
     + url_1='https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1684832645932&loginType=3&uuid=122270672.2081861737.1683857907.1684829964.1684832583.3&productId=100009464799&score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1&bbtf=1&shield='
     + url_2='https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1684832645932&loginType=3&uuid=122270672.2081861737.1683857907.1684829964.1684832583.3&productId=100009464799&score=0&sortType=5&page=2&pageSize=10&isShadowSku=0&rid=0&fold=1&bbtf=1&shield='
     - Finally, we can build a page turn through a loop
       
### about header
- Purpose：Pretend to be a browser
- Creation process：find the **request header** in the header, pay attention to the key parts, such as **cookies**

### About the response code
- Different numbers have different meanings

## code

#### 1.Get web page data

In [19]:
import requests
import json
import pandas as pd

In [31]:
header = {'User-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
         'referer':'https://item.jd.com/',
          'cookie':'your cookie'

url='https://api.m.jd.com/?appid=item-v3&functionId=pc_club_productPageComments&client=pc&clientVersion=1.0.0&t=1684832645932&loginType=3&uuid=122270672.2081861737.1683857907.1684829964.1684832583.3&productId=100009464799&score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1&bbtf=1&shield='
response= requests.get(url=url,headers=header)
if response.status_code == 200:
    data = response.json()  # Parsing JSON data
else:
    print('Request failed, status code：', response.status_code)
#data=json['comments']

In [33]:
data#Dictionary form

{'jwotestProduct': None,
 'score': 0,
 'comments': [{'id': 21634368046,
   'guid': 'a77fcbefcf2d10aecae3df142e40b612',
   'content': '笔记本收到了，孩子还在试用中，不是太习惯用这个苹果系统，装软件不是太顺利，这次购买价格总体很满意，优惠的不少，京东服务很好，可以当场开箱验货，有需要还会来光顾的，五星好评！',
   'creationTime': '2024-10-25 12:37:20',
   'isDelete': False,
   'isTop': False,
   'userImageUrl': 'storage.360buyimg.com/i.imageUpload/31333333333833343136315f7031373136363837363035393237_sma.jpg',
   'topped': 0,
   'replyCount': 0,
   'score': 5,
   'imageStatus': 1,
   'usefulVoteCount': 5,
   'userClient': 4,
   'discussionId': 1627983852,
   'imageCount': 2,
   'anonymousFlag': 1,
   'plusAvailable': 201,
   'mobileVersion': '13.6.0',
   'images': [{'id': -1495229913,
     'imgUrl': '//img30.360buyimg.com/n0/s128x96_jfs/t1/180818/6/51018/74872/671b207fFb9497e75/b38e87684a475dfb.jpg',
     'imgTitle': '',
     'status': 0},
    {'id': -1495229912,
     'imgUrl': '//img30.360buyimg.com/n0/s128x96_jfs/t1/149140/33/47412/55151/671b2080Fe84f92ed/f7ad0c1a2d8c31e9.

#### 2.Extracting data we want

In [27]:
import json
data_comments=data['comments']
items=[]
for t in data_comments:
    content =t['content']
    time=t['creationTime']
    item=[time,content]
    items.append(item)
    
df = pd.DataFrame(items,columns=['发布时间','评论内容'])

In [29]:
df

Unnamed: 0,发布时间,评论内容
0,2024-10-25 12:37:20,笔记本收到了，孩子还在试用中，不是太习惯用这个苹果系统，装软件不是太顺利，这次购买价格总体很...
1,2024-10-17 18:06:50,趁政府补贴力度大赶紧下手换了已经用了8年的老电脑 很划算 查了也是全新机 送达速度也超快 下...
2,2024-10-15 09:13:39,苹果一如既往的高级和流畅，手机-平板，到现在购笔记本，都是那么喜欢！序列号验证过是没激活过，...
3,2024-10-14 22:08:10,在网上关注很久了，让朋友给我领取了，政府补贴下单后非常的优惠，不愧是苹果运行速度非常好屏幕看...
4,2024-10-22 00:18:27,运行速度：运行速度很快很流畅完全不卡顿\n屏幕效果：屏幕效果秒杀同价位笔记本电脑，非常清晰色...
5,2024-10-28 00:55:28,非常好！！苹果电脑颜值太高咯真的很喜欢啊！！\n志有什么问题问客服都很耐心的解答！\n很轻薄...
6,2021-07-02 12:12:00,收到了，孩子说还得买手机
7,2024-10-25 23:21:52,这款手机不仅拥有出色的性能，而且设计精美，用户体验极佳。它的处理器速度快，运行流畅，无论是玩...
8,2024-10-18 16:24:00,这个是真的很好用啊，大品牌，值得信赖的品牌形象设计师和服务团队的核心竞争能力和服务团队实力得...
9,2024-11-04 22:03:33,运行速度：很快，不愧是水果平板\n屏幕效果：很清晰\n散热性能：很棒\n外形外观：太好看了，...
