## 1. Get webpage using *requests*

In [6]:
import requests

#req = requests.get('https://portal.gdc.cancer.gov/projects/TCGA-LIHC')
req = requests.get('https://en.wikipedia.org/wiki/The_Cancer_Genome_Atlas')

In [7]:
req

<Response [200]>

In [8]:
webpage = req.text

In [9]:
with open("filename", "wb") as f:
    f.write(webpage.encode('utf-8'))

In [10]:
print(webpage)

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-typography-survey-disabled vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>The Cancer Genome Atlas - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu

## 2. Get specific contents using BeatifulSoup

In [11]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(webpage, 'html.parser')

### 2.1 Prettify the webpage

In [12]:
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-zebra-design-enabled vector-feature-custom-font-size-clientpref-0 vector-feature-client-preferences-disabled vector-feature-client-prefs-pinned-disabled vector-feature-typography-survey-disabled vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   The Cancer Genome Atlas - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vecto

### 2.2 Get the first paragraph

You can try to remove "attrs" to see how it works.

In [23]:
first_three_paragraph = soup.find_all('p', limit=3)

In [24]:
first_three_paragraph

[<p><b>The Cancer Genome Atlas</b> (<b>TCGA</b>) is a project to catalogue the <a class="mw-redirect" href="/wiki/Genetic_mutation" title="Genetic mutation">genetic mutations</a> responsible for <a href="/wiki/Cancer" title="Cancer">cancer</a> using <a class="mw-redirect" href="/wiki/Genome_sequencing" title="Genome sequencing">genome sequencing</a> and <a href="/wiki/Bioinformatics" title="Bioinformatics">bioinformatics</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> The overarching goal was to apply <a href="/wiki/Multiplex_(assay)" title="Multiplex (assay)">high-throughput genome analysis techniques</a> to improve the ability to diagnose, treat, and prevent cancer through a better understanding of the genetic basis of the disease.
 </p>,
 <p>TCGA was supervised by the <a href="/wiki/National_Cancer_Institute" title="National Cancer Institute">National Cancer Institute</a>'s <a 

In [25]:
first_three_paragraph = soup.find('p', attrs={"class":False})

In [26]:
first_three_paragraph

<p><b>The Cancer Genome Atlas</b> (<b>TCGA</b>) is a project to catalogue the <a class="mw-redirect" href="/wiki/Genetic_mutation" title="Genetic mutation">genetic mutations</a> responsible for <a href="/wiki/Cancer" title="Cancer">cancer</a> using <a class="mw-redirect" href="/wiki/Genome_sequencing" title="Genome sequencing">genome sequencing</a> and <a href="/wiki/Bioinformatics" title="Bioinformatics">bioinformatics</a>.<sup class="reference" id="cite_ref-1"><a href="#cite_note-1">[1]</a></sup><sup class="reference" id="cite_ref-2"><a href="#cite_note-2">[2]</a></sup> The overarching goal was to apply <a href="/wiki/Multiplex_(assay)" title="Multiplex (assay)">high-throughput genome analysis techniques</a> to improve the ability to diagnose, treat, and prevent cancer through a better understanding of the genetic basis of the disease.
</p>

### 2.3 Get all the links in this paragraph which point to other webpages

In [27]:
if first_three_paragraph:
    links = first_three_paragraph.find_all('a')
    urls = [link.get('href') for link in links]  # Extract URLs from the links
else:
    urls = [] 

In [28]:
first_three_paragraph.find_all('a', attrs={"title":True})

[<a class="mw-redirect" href="/wiki/Genetic_mutation" title="Genetic mutation">genetic mutations</a>,
 <a href="/wiki/Cancer" title="Cancer">cancer</a>,
 <a class="mw-redirect" href="/wiki/Genome_sequencing" title="Genome sequencing">genome sequencing</a>,
 <a href="/wiki/Bioinformatics" title="Bioinformatics">bioinformatics</a>,
 <a href="/wiki/Multiplex_(assay)" title="Multiplex (assay)">high-throughput genome analysis techniques</a>]

In [29]:
data = {"title":[], "href":[]}
for link in first_three_paragraph.find_all('a', attrs={"title":True}):
    data["title"].append(link["title"])
    data["href"].append(link["href"])

In [30]:
import pandas as pd
df = pd.DataFrame(data)

In [31]:
df

Unnamed: 0,title,href
0,Genetic mutation,/wiki/Genetic_mutation
1,Cancer,/wiki/Cancer
2,Genome sequencing,/wiki/Genome_sequencing
3,Bioinformatics,/wiki/Bioinformatics
4,Multiplex (assay),/wiki/Multiplex_(assay)


In [32]:
df.to_csv('The_Cancer_Genome_Atlas.csv', index=False) 
