Skip to content

Commit

Permalink
reconstruction
Browse files Browse the repository at this point in the history
fix #36
fix #37
fix #38
  • Loading branch information
wtto00 committed Feb 15, 2023
1 parent 5715c90 commit ca80edd
Show file tree
Hide file tree
Showing 11 changed files with 297 additions and 240 deletions.
5 changes: 3 additions & 2 deletions .eslintrc.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@
},
"plugins": ["import", "@typescript-eslint"],
"rules": {
"import/extensions": ["error", "ignorePackages", { "js": "never" }],
"max-len": ["error", { "code": 120 }]
"import/extensions": ["error", "ignorePackages", { "js": "never", "ts": "never" }],
"max-len": ["error", { "code": 120 }],
"@typescript-eslint/no-explicit-any": ["off"]
}
}
3 changes: 1 addition & 2 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,7 @@
"request": "launch",
"console": "integratedTerminal",
"internalConsoleOptions": "neverOpen",
"disableOptimisticBPs": true,
"cwd": "/home/wtto/projects/github/wtto00/spider-crawler",
"cwd": "/home/wtto/projects/github/wtto00/node-spider-crawler",
"runtimeExecutable": "npm",
"args": [
"run",
Expand Down
3 changes: 2 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,6 @@
"jest.autoRun": { "watch": false, "onStartup": ["all-tests"] },
"jest.showCoverageOnLoad": true,
"jest.jestCommandLine": "npm run test --",
"jest.rootPath": "."
"jest.rootPath": ".",
"liveServer.settings.port": 5501
}
52 changes: 23 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,25 +104,10 @@ const res = crawlFromHtml(options);

## CrawlFromJson Options

| 字段 | 类型 | 备注 |
| ----- | ----------------------- | ------------ |
| json | string | json 字符串 |
| rules | [JsonRules](#JsonRules) | 取值处理规则 |

### JsonRules

```typescript
type JsonRules = Record<string, JsonRule>;
```

#### JsonRule

| 字段 | 类型 | 必填 | 备注 |
| -------- | --------------------- | ---- | -------------------------------------------------------- |
| selector | string || [json 取值规则](https://www.lodashjs.com/docs/lodash.at) |
| handlers | [Handler](#Handler)[] || 数据处理方法集合 |

JsonRule 的 Handler 中的 Method,只有 `prefix`,`substring`,`replace`,`trim`,`number`,`br2nl`。其他处理方法为无效值。
| 字段 | 类型 | 备注 |
| ----- | --------------- | ------------ |
| json | string | json 字符串 |
| rules | [Rules](#Rules) | 取值处理规则 |

## CrawlFromHtml Options

Expand All @@ -132,7 +117,7 @@ JsonRule 的 Handler 中的 Method,只有 `prefix`,`substring`,`replace`,`trim
| html | string || html 字符串 |
| rules | [Rules](#Rules) || 取值处理规则 |

## Options
## CrawlFromUrl Options

| 字段 | 类型 | 备注 |
| ------- | --------------------------------------------------------------- | -------- |
Expand All @@ -148,10 +133,11 @@ type Rules = Record<string, Rule>;

#### Rule

| 字段 | 类型 | 必填 | 备注 |
| -------- | --------------------- | ---- | ------------------------------------------------------------------------------------------------------ |
| selector | string || [cheerio 选择器](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#%E9%80%89%E6%8B%A9%E5%99%A8) |
| handlers | [Handler](#Handler)[] || 爬虫爬取到的元素的处理方法集合 |
| 字段 | 类型 | 必填 | 备注 |
| -------- | --------------------- | ---- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| selector | string || [cheerio 选择器](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#%E9%80%89%E6%8B%A9%E5%99%A8) |
| dataType | 'html'\|'json' || selector 是 [cheerio 选择器](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#%E9%80%89%E6%8B%A9%E5%99%A8),还是 [json 选择器](https://www.lodashjs.com/docs/lodash.at) |
| handlers | [Handler](#Handler)[] || 爬虫爬取到的元素的处理方法集合 |

#### Handler

Expand All @@ -167,7 +153,7 @@ interface Handler {
下边列举所有的可以方法以及相对应的参数

- **prefix**
开头添加字符串
字符串开头添加字符串
`args: [string]`
- **substring**
对字符串结果进行截取
Expand All @@ -178,16 +164,19 @@ interface Handler {
- **trim**
去除开头与结尾的空格
不需要`args`
- **resolveUrl**
获得的路径与当前请求地址相混合
不需要`args`
- **number**
把字符串转为数字
不需要`args`
- **br2nl**
`html` 中的 `br` 替换成文本换行符`\n`
匹配`<br />,<br/><br >,<br>`以及其中的空格以及`\n`换行符
不需要`args`
- **sum**
把字符串数组转为数字后相加
不需要`args`
- **resolveUrl**
获得的路径与当前请求地址相混合
不需要`args`
- **decode**
html 字符串反序列化到正常的阅读文本
不需要`args`
Expand All @@ -214,8 +203,13 @@ interface Handler {
不需要`args`
- **map**
[cheerio 方法](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#map-functionindex-element--1)
通过每个在匹配函数产生的匹配集合中的匹配元素,产生一个新的包含返回值的 cheerio 对象
通过每个在匹配函数产生的匹配集合中的匹配元素,产生一个新的包含返回值的数组
该函数可以返回一个单独的数据项或一组数据项被插入到所得到的集合中。
如果返回一个数组,数组中的元素插入到集合中。
如果函数返回空或未定义,则将插入任何元素。
`args: [Rules]`
- **each**
[cheerio 方法](https://github.com/cheeriojs/cheerio/wiki/Chinese-README#each-functionindex-element-)
对一个 cheerio 对象循环进行一些处理,得到一个新的数组。
此方法与 map 方法的不同在于,map 总是返回一个对象数组,而 each 不一定返回对象数组。
`args: [Handler[]]`
25 changes: 9 additions & 16 deletions debugger.ts
Original file line number Diff line number Diff line change
@@ -1,22 +1,15 @@
import type { CrawlerJsonOptions } from './src/types.js';
import { crawlFromJson } from './src/index.js';
import { crawlFromUrl, CrawlerUrlOptions } from './src/index.js';

const jsonOptions: CrawlerJsonOptions = {
json: JSON.stringify({}),
const options: CrawlerUrlOptions = {
url: 'https://gitee.com/wtto00/badge-test/issues',
rules: {
// pickUndefined: {
// selector: 'a.b',
// },
// selectorEmpty: {
// selector: '',
// },
quotePick: {
selector: "a['\nb']",
total: {
selector: '#git-issues-filters a.item div.label',
handlers: [{ method: 'each', args: [[{ method: 'number' }]] }, { method: 'sum' }],
},
},
};

crawlFromJson(jsonOptions);
// .then((res) => {
// console.log(JSON.stringify(res));
// });
crawlFromUrl(options).then((res) => {
console.log(res.data);
});
Loading

0 comments on commit ca80edd

Please sign in to comment.