# Why Understanding HTML and Markdown Is Important in Web Scraping

## 1. Why You Need to Understand HTML  
HTML is a markup language that defines the structure of web pages, so understanding the HTML structure is essential for accurately extracting data during web scraping.

- Understanding Tags and Hierarchy  
    - Knowing the roles of HTML tags such as `<div>`, `<span>`, `<table>`, `<ul>`, and `<li>` makes it easier to locate the desired data.  
    - You can use attributes such as `id`, `class`, `name`, and `href` to select specific elements.

- Using XPath and CSS Selectors  
    - Libraries like BeautifulSoup (Python) and rvest (R) use CSS selectors or XPath to find specific elements.  
    - Example: `soup.select('div.article > p')` (BeautifulSoup) or `html_nodes(doc, "div.article > p")` (rvest)

- Handling Dynamic Web Pages (Understanding JavaScript Rendering)  
    - Some websites use JavaScript to load data dynamically, so tools like Selenium or Playwright are needed.

## 2. Why You Need to Understand Markdown  
Markdown is mainly used for documentation, blog posts, and API docs, and you may need to handle it when extracting data from the web.

- Handling Web Pages with Markdown  
    - You may need to retrieve data stored in Markdown format from websites (e.g., GitHub, Jupyter Notebook, blogs) and convert it to HTML.  
    - You might also need to restore Markdown after extracting only text from HTML using `BeautifulSoup.get_text()`.

- Cleaning and Converting Markdown Data  
    - For example, you may need to crawl text saved in Markdown from a web page and convert it to another format (HTML, LaTeX, etc.).  
    - You can use `markdown2`, `mistune` (Python), or `markdown` (R package) for conversion.

# 1. HTML Key Concepts

## What is HTML?
- **HTML (HyperText Markup Language)** is used to structure content on the web.
- Uses tags to structure content (text, images, links, etc.).
- A basic web page typically includes HTML, CSS, and JavaScript.

## Basic structure
- The basic structure of an HTML document includes: `<html> → <head> → <body>`

```html
<!DOCTYPE html>
<html>
<head>
  <title>Page Title</title>
</head>
<body>
  <!-- Content goes here -->
</body>
</html>
```

## Common HTML Tags

- `<h1>` to `<h6>`: Headings (`<h1>` is the largest)
```html
  <h1>Main Title</h1>
  <h2>Section Title</h2>
```
- `<p>`: Paragraph
```html
<p>This is a paragraph of text.</p>
```
- `<strong>`: Bold text
```html
<p>This is <strong>important</strong> information.</p>
```
- `<em>`: Italic text
```html
<p>This word is <em>emphasized</em>.</p>
```
- `<a href="URL">`: Hyperlink
```html
<a href="https://example.com">Visit Example</a>
```
- `<img src="path" alt="description">`: Image
```html
<img src="profile.jpg" alt="Profile Photo" width="200">
```
- `<video controls>` + `<source src="file.mp4" type="video/mp4">`: Embed local video
```html
<video width="640" height="360" controls>
  <source src="my_video.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>
```
- `<iframe src="https://www.youtube.com/embed/ID">`: Embed YouTube video
```html
<iframe width="640" height="360"
        src="https://www.youtube.com/embed/VIDEO_ID"
        frameborder="0"
        allowfullscreen>
</iframe>
```

# 2. Lists, Tables, Forms in HTML

## Lists in HTML (`<ul>`, `<ol>`)

- Used to display a list of items on a webpage. Two main types:

    - `<ul>`: Unordered list (with bullet points)
    - `<ol>`: Ordered list (with numbers)
    - `<li>`: List item

```html
<h2>My Academic Interests</h2>
<ul>
  <li>Statistics</li>
  <li>Data Analysis & Data Visualization</li>
  <li>Sports Big Data</li>
</ul>

<h2>My Hobbies</h2>
<ol>
  <li>Piano</li>
  <li>Soccer</li>
  <li>Original Sound Track</li>
</ol>
```

## Tables in HTML (`<table>`)

Used to display structured data in a grid format.

- `<table>`: Table container
- `<tr>`: Table row
- `<th>`: Table header cell (bold)
- `<td>`: Table data cell
```html
<h2>My Profile</h2>
<table border="1">
  <tr>
    <th>Item</th>
    <th>Details</th>
  </tr>
  <tr>
    <td>Name</td>
    <td>Soonwon KWON</td>
  </tr>
  <tr>
    <td>Occupation</td>
    <td>Fourth-year student</td>
  </tr>
</table>
```

## Forms in HTML (`<form>`)

Used to collect user input.

- `<form>`: Wraps the input elements
- `<input>`: Single-line input field (e.g., text, checkbox, password)
- `<textarea>`: Multi-line text input
- `<button>`: Clickable button (submit, etc.)
```html
<h2>Leave a Message</h2>
<form action="/submit" method="POST">
  <label for="name">Name:</label>
  <input type="text" id="name" name="name" required><br><br>

  <label for="message">Message:</label><br>
  <textarea id="message" name="message" rows="4" cols="40"></textarea><br><br>

  <button type="submit">Submit</button>
</form>
```
- `<label>`: Describes input field
- `<input type="text">`: Single-line text input
- `<textarea>`: Multi-line text area
- `<button type="submit">`: Submit button

# 3. CSS Basics & Webpage Styling

## What is CSS?
- CSS (Cascading Style Sheets) is a language that controls the **style and layout** of HTML elements.
- While HTML creates the structure, CSS adjusts **colors**, **sizes**, and **layout**.

## Three Ways to Apply CSS

### Inline Style (using `style` attribute)
```html
<p style="color: blue;">This text is blue.</p>
```
- Style applied directly to HTML elements  
- Not recommended (hard to maintain, messy)

### Internal Style (using `<style>` tag)
```html
<head>
  <style>
    p {
      color: blue;
      font-size: 18px;
    }
  </style>
</head>
```
- CSS written inside the HTML document  
- Useful for small projects

### External Style (recommended)
```html
<head>
  <link rel="stylesheet" href="styles.css">
</head>
```
- CSS written in a separate `.css` file  
- Easy to maintain and reuse across pages

## CSS Syntax & Selectors

### Basic Syntax
```css
selector {
  property: value;
}
```
- **Selector**: targets an element  
- **Property**: what you want to change  
- **Value**: how you want it to look

### Common CSS Selectors

| Selector | Description                    | Example                      |
|----------|--------------------------------|------------------------------|
| `*`      | All elements                   | `* { margin: 0; }`           |
| `h1`     | Tag name selector              | `h1 { color: red; }`         |
| `.class` | Class selector                 | `.title { font-size: 20px; }`|
| `#id`    | ID selector                    | `#header { background: black; }` |
| `A, B`   | Multiple selectors             | `h1, p { color: blue; }`     |
| `A B`    | Descendant selector (B inside A) | `div p { color: green; }`  |

### Example CSS
```css
/* Apply to whole page */
body {
  font-family: Arial, sans-serif;
  background-color: #f0f0f0;
}

/* Heading style */
h1 {
  color: darkblue;
  text-align: center;
}

/* Paragraph style */
p {
  font-size: 18px;
  color: gray;
}
```

## CSS Box Model

### What is the Box Model?
Every HTML element is treated as a rectangular box with:

- **Content**: text or image inside
- **Padding**: space between content and border
- **Border**: line surrounding the box
- **Margin**: space outside the border (separates from other elements)

### Box Model Example
```css
.box {
  width: 300px;
  padding: 20px;
  border: 2px solid black;
  margin: 10px;
}
```

```html
<div class="box">This is a box model example.</div>
```

## CSS Layout: `display` & Flexbox

### `display` Property

| Value   | Description                              |
|---------|------------------------------------------|
| `block` | Elements stack vertically (e.g., `<div>`) |
| `inline` | Elements stay inline (e.g., `<span>`)    |
| `flex`  | Flexible layout system for arrangement    |

### Flexbox Basic Example
```css
.container {
  display: flex;
  justify-content: space-around;
}
.box {
  width: 100px;
  height: 100px;
  background-color: lightblue;
}
```

```html
<div class="container">
  <div class="box">1</div>
  <div class="box">2</div>
  <div class="box">3</div>
</div>
```

# 4. Markdown Basics

## What is Markdown?

- Markdown is a lightweight markup language for writing documents using plain text.
- Easier than HTML and highly readable.
- Widely used in platforms like GitHub, Jupyter Notebook, and RMarkdown.
- In R, Markdown is used in `.Rmd` files to combine documentation and executable code.

## Basic Markdown Syntax

### Headers

```markdown
# Header 1  
## Header 2  
### Header 3
```

### Emphasis

```markdown
*Italic* or _Italic_  
**Bold** or __Bold__  
~~Strikethrough~~
```

### Lists

- **Unordered List:**
```markdown
- Item 1  
  - Sub-item 1.1  
  - Sub-item 1.2  
- Item 2
```

- **Ordered List:**
```markdown
1. First item  
2. Second item  
3. Third item
```

### Links

```markdown
[CRAN R Official Site](https://cran.r-project.org/)
```

[CRAN R Official Site](https://cran.r-project.org/)

### Images

```markdown
![R Logo](https://www.r-project.org/logo/Rlogo.png)
```

![R Logo](https://www.r-project.org/logo/Rlogo.png)

# 5. Introduction to RMarkdown

## What is RMarkdown?

- RMarkdown (`.Rmd`) is a document format that combines R code and narrative text.
- Allows you to create documents with analysis results and interpretation together.
- Output formats: **HTML, PDF, Word**.
- Structure: **YAML Header + Markdown Text + R Code Chunks**

## RMarkdown Document Structure

### 1. YAML Header

```yaml
---
title: "RMarkdown Example"
author: "Soonwon KWON"
date: "`r Sys.Date()`"
output: html_document
---
```

- Defines document title, author, date, and output format.

### 2. Markdown Text + R Code Chunk

```markdown
## Data Analysis Results

Here we compute some basic summary statistics.
```

```{r}
summary(cars)
```

- **R code chunks (` ```{r} `)** allow you to run R code inside the document.

### 3. Output Formats: HTML, PDF, Word

- After writing your `.Rmd` file, click the **Knit** button to render the document.
- Requires the `rmarkdown` package.