# 初始化

In [1]:
# 初始化，请填写appname参数，其中包含你的用户名和本次使用数据集
%run spark_openalex_init.py --appname=MarioZZJ_OpenAlex

可以在新标签页访问上面的 Spark UI 链接，查看任务执行进度，也可以去 Spark UI 总页面，查看多个 app 的并行运行情况。

# 查看 schema

此时我们获取了 27 个 `pyspark.sql.dataframe.DataFrame` 对象，分别对应 27 张表，可以使用 `showSchema()` 方法查看字段。

27 个对象的变量名：`Authors`, `AuthorsIds`, `AuthorsCountsByYear`, `Concepts`, `ConceptsAncestors`, `ConceptsCountsByYear`, `ConceptsIds`, `ConceptsRelatedConcepts`, `Institutions`, `InstitutionsAssociatedInstitutions`, `InstitutionsCountsByYear`, `InstitutionsGeo`, `InstitutionsIds`, `Works`, `WorksAuthorships`, `WorksAlternateHostVenues`, `WorksBiblio`, `WorksConcepts`, `WorksHostVenues`, `WorksIds`, `WorksMesh`, `WorksOpenAccess`, `WorksRelatedWorks`, `WorksReferencedWorks`, `Venues`, `VenuesCountsByYear`, `VenuesIds`。

各表详细的 Schema 可以查看 [OpenAlex 文档](https://docs.openalex.org/download-snapshot/upload-to-your-database/load-to-a-relational-database/postgres-schema-diagram)。
![schema](https://2520693015-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FSj6S26Opvy3KVj3QQGMc%2Fuploads%2FS3eqe0lsHYmqxJTcPa9V%2Fopenalex-schema.png?alt=media&token=64f070a8-ca96-4639-96d7-06d8a6ea659d)

In [5]:
Authors.printSchema()

root
 |-- id: string (nullable = true)
 |-- orcid: string (nullable = true)
 |-- display_name: string (nullable = true)
 |-- display_name_alternatives: string (nullable = true)
 |-- works_count: integer (nullable = true)
 |-- cited_by_count: integer (nullable = true)
 |-- last_known_institution: string (nullable = true)
 |-- works_api_url: string (nullable = true)
 |-- updated_date: date (nullable = true)
 |-- display_name_alternatives_array: array (nullable = true)
 |    |-- element: string (containsNull = true)



In [6]:
Works

DataFrame[id: string, doi: string, title: string, display_name: string, publication_year: int, publication_date: string, type: string, cited_by_count: int, is_retracted: boolean, is_paratext: boolean, host_venue: string]

# 案例：模糊检索
类似于 SQL 中的 `WHERE` `LIKE`

In [8]:
# 查找前 10 篇标题包含'covid'的论文
works = Works.filter(
            Works.display_name.like('%covid%') # filter() 过滤中填入表达式，可以用`==`精确匹配，也可用 like() 模糊匹配
        ).limit(10).toPandas() # 建议探索时多用 limit，减少对资源的占用 
                    # toPandas() 转化为 dataframe，直接在 notebook 展示
works

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,type,cited_by_count,is_retracted,is_paratext,host_venue
0,https://openalex.org/W4213007532,https://doi.org/10.15446/ts.v24n1.94179,"Concepciones de salud, Sistema de salud públic...","Concepciones de salud, Sistema de salud públic...",2022,2022-01-03,journal-article,0,False,,https://api.openalex.org/works?filter=cites:W4...
1,https://openalex.org/W4225105503,https://doi.org/10.1136/bmj.o867,The benefits of large scale covid-19 vaccination.,The benefits of large scale covid-19 vaccination.,2022,2022-04-27,journal-article,0,False,,https://api.openalex.org/works?filter=cites:W4...
2,https://openalex.org/W4225259912,https://doi.org/10.1136/bmj.o1096,The government wants us to learn to live with ...,The government wants us to learn to live with ...,2022,2022-04-29,journal-article,0,False,,https://api.openalex.org/works?filter=cites:W4...
3,https://openalex.org/W4225852779,https://doi.org/10.29327/icidsuim20221.461057,Saberes e práticas: primeiro ano de uma residê...,Saberes e práticas: primeiro ano de uma residê...,2022,2022-01-01,book-chapter,0,False,False,https://api.openalex.org/works?filter=cites:W4...
4,https://openalex.org/W4206974766,https://doi.org/10.14393/ufu.di.2022.24,A atuação da UFU frente a pandemia de covid-19...,A atuação da UFU frente a pandemia de covid-19...,2022,2022-01-24,dissertation,0,False,False,https://api.openalex.org/works?filter=cites:W4...
5,https://openalex.org/W4200272339,https://doi.org/10.1080/13527266.2021.2012499,Advertising content and online engagement on s...,Advertising content and online engagement on s...,2021,2021-12-26,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W4...
6,https://openalex.org/W4226093403,https://doi.org/10.1080/0142159x.2022.2027902,The need to preserve the humanistic nature of ...,The need to preserve the humanistic nature of ...,2022,2022-01-20,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W4...
7,https://openalex.org/W4210273141,https://doi.org/10.7476/9786557081587.0027,A experiência do Proqualis na produção e divul...,A experiência do Proqualis na produção e divul...,2022,2022-01-26,book-chapter,0,False,False,https://api.openalex.org/works?filter=cites:W4...
8,https://openalex.org/W4210584131,https://doi.org/10.1016/j.shaw.2021.12.1275,Which jobs are lucky against the “biologic” an...,Which jobs are lucky against the “biologic” an...,2022,2022-01-01,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W4...
9,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...


In [9]:
WorksReferencedWorks

DataFrame[work_id: string, referenced_work_id: string]

# 案例：表连接
类似于 SQL 中的 `JOIN`

In [11]:
# 查找文献x的前10篇引文
selected_citations = Works.filter(Works.id=='https://openalex.org/W3213651567').join(
    WorksReferencedWorks, # 第一个参数：连接的表
    Works.id == WorksReferencedWorks.work_id, # 第二个参数：连接依据，where
    how = 'inner') # 第三个参数：连接方式，这里为 inner join
selected_citations.limit(10).toPandas() # 如果没有调用 toPandas()、show()等，变量仅保留计算图，并未实际开始计算

Unnamed: 0,id,doi,title,display_name,publication_year,publication_date,type,cited_by_count,is_retracted,is_paratext,host_venue,work_id,referenced_work_id
0,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W1493925022
1,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W1765504387
2,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W3039521750
3,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W3081416251
4,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W3094350895
5,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W3132890237
6,https://openalex.org/W3213651567,https://doi.org/10.48102/pi.v29i3.414,Psicología basada en la evidencia en tiempos d...,Psicología basada en la evidencia en tiempos d...,2021,2021-11-09,journal-article,0,False,False,https://api.openalex.org/works?filter=cites:W3...,https://openalex.org/W3213651567,https://openalex.org/W3172256592


# 案例：聚合统计
类似于 SQL 中的 `GROUP BY`

In [15]:
# 统计文献x的引文
selected_meshes = WorksMesh.filter(WorksMesh.work_id == 'https://openalex.org/W2513796179')\
                    .groupBy(WorksMesh.work_id).agg(# agg() 聚合函数
                    F.countDistinct(WorksMesh.descriptor_ui)# countDistinct为去重计数，更多函数可查看 pyspark.sql.functions 的手册
                    .alias("Mesh_D_Count"))# alias 为新生成的列指定别名

selected_meshes.toPandas()

Unnamed: 0,work_id,Mesh_D_Count
0,https://openalex.org/W2513796179,7


# 案例：列裁剪
类似于 SQL 中的 `SELECT`。对于研究需要，大多数表我们只需要部分字段。尽早进行列裁剪对于减少计算压力具有一定帮助。

In [16]:
# 取出 Works 表的前 10 条数据，只需要 id 和 doi 两个字段
works_sub = Works.select(
                Works.id,
                Works.doi
)
works_sub.limit(10).toPandas()

Unnamed: 0,id,doi
0,https://openalex.org/W2973187907,https://doi.org/10.31080/asor.2019.02.0071
1,https://openalex.org/W2513002348,https://doi.org/10.1016/j.bbmt.2015.11.882
2,https://openalex.org/W2946982473,https://doi.org/10.1016/s1569-1993(19)30280-2
3,https://openalex.org/W2960429651,https://doi.org/10.1088/1755-1315/252/5/052020
4,https://openalex.org/W3022436951,
5,https://openalex.org/W3022449202,https://doi.org/10.1016/0020-7292(94)90341-7
6,https://openalex.org/W3022723715,https://doi.org/10.7554/elife.50232.sa1
7,https://openalex.org/W2982244448,https://doi.org/10.1093/eurheartj/ehz745.0735
8,https://openalex.org/W3022891036,https://doi.org/10.35826/ijetsar.37
9,https://openalex.org/W3022072931,https://doi.org/10.1111/an.1239


# 案例：保存数据
有时我们需要将中间结果进行保存，如果数据量较小，可以对 toPandas() 的 dataframe 对象直接使用 pandas 库的方法保存，而当数据量较大时，建议使用 spark 的 `write()` 方法保存为多个文件，减小内存和运算压力。

In [17]:
# 取出 Works 表的前 1000 条数据，只需要 id 和 doi 两个字段，保存至指定文件夹
works_sub.limit(1000).write.csv("./data/id_doi",header=True,mode="overwrite") # header 为 True 时，每个子文件均包含表头（建议，方便后续读取）

此时在 `./data/id_doi` 文件夹下就会存储若干个文件，保存所有取出数据

# 案例：读取数据
对于上述保存的分片数据，同样可以使用 spark 的 `read()` 方法将其读取为 `pyspark.sql.dataframe.DataFrame` 对象，进行后续操作。 

In [18]:
works_sub = spark.read.csv("./data/id_doi", header=True, inferSchema=True)
print(type(works_sub))
works_sub.limit(10).toPandas()

<class 'pyspark.sql.dataframe.DataFrame'>


Unnamed: 0,id,doi
0,https://openalex.org/W2105305796,https://doi.org/10.1097/01.ju.0000118480.04204.35
1,https://openalex.org/W2065086217,https://doi.org/10.5261/2014.gen4.08
2,https://openalex.org/W2065099785,https://doi.org/10.1111/j.1525-1438.2007.00785.x
3,https://openalex.org/W2065112245,https://doi.org/10.1016/j.transproceed.2014.08...
4,https://openalex.org/W2065115307,https://doi.org/10.1055/s-2005-870897
5,https://openalex.org/W2065158039,https://doi.org/10.1016/j.jopan.2006.03.001
6,https://openalex.org/W2065347846,https://doi.org/10.1038/433367a
7,https://openalex.org/W2271568448,https://doi.org/10.1111/trf.13514
8,https://openalex.org/W2973075386,https://doi.org/10.1117/12.2532096
9,https://openalex.org/W3213985658,https://doi.org/10.2139/ssrn.3170307


更多使用方法，可以参考 [PySpark 手册](https://spark.apache.org/docs/3.1.2/api/python/getting_started/index.html)

# 重要：释放 session

请在使用资源时保持良好习惯，结束使用时执行下列语句对占用资源进行释放；或 shutdown notebook 也可对资源进行释放。

In [19]:
spark.sparkContext.stop()
spark.stop()