### **git and github**

**What are git and github all about? - Answer provided by chatgpt 4o mini:**

**Git** is a distributed version-control system designed to track changes in any set of files—most often source code—enabling multiple people to work on a project without overwriting each other’s work. Every developer’s copy of a Git “repository” contains the full project history, so you can commit new changes, review past snapshots, branch to experiment, merge work from others and even roll back mistakes—all offline and very efficiently.

**GitHub** is a cloud-based hosting service for Git repositories with a web-based graphical interface. Beyond simply storing code, GitHub adds collaboration tools—pull requests for proposing and reviewing changes, issue trackers for reporting bugs or planning features, wikis and project boards for documentation and workflow, and automated actions for CI/CD. It integrates social features (stars, follows), access controls, and community-driven open-source project discovery.

In practice, you “clone” a GitHub repo to your local machine, make commits in Git, then “push” your updates back to GitHub where teammates can see, comment on or merge them. This combination of Git’s powerful history and branching model with GitHub’s collaborative ecosystem has revolutionized modern software development.


* **intro to git and ghithub:**
    https://youtu.be/DVRQoVRzMIY?si=hzzvPar1rmIldReS (tech with tim)
    
   https://youtu.be/tRZGeaHPoaw?si=2NAuP-mnlHaKnJfL (Kevin Stratvert)

*   **Create a github account:**: https://youtu.be/Gn3w1UvTx0A?si=Z1JAftfOxKjj1QK3


*   **Generate a personal access token to github:** https://youtu.be/iLrywUfs7yU?si=cXnHNaRItqeWIClF







#### **connect from colab to github** : token and gitemail have to be stored as secrets in the colab keys

In [None]:
from google.colab import userdata
import os

# get token and email - token and gitemail have to be stored as secrets in the colab keys
token = userdata.get('gittoken')
gitemail = userdata.get('gitemail')

!git config --global user.name „rps007“
!git config --global user.email {gitemail}

repo = "" # fill in repo
username = "rps007"
repository = f"https://github.com/{username}/{repo}.git"


# Using the token for cloning
!git clone https://{username}:{token}@github.com/{username}/{repo}.git

# change path of the working directory
repo = repository.split('/')[-1].split('.')[0]
folder = f"/content/{repo}"
os.chdir(folder)

print(f"\nCurrent working directory set to: {os.getcwd()}")

Cloning into 'kopptools'...
remote: Enumerating objects: 368, done.[K
remote: Counting objects: 100% (35/35), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 368 (delta 19), reused 13 (delta 5), pack-reused 333 (from 2)[K
Receiving objects: 100% (368/368), 58.23 MiB | 13.22 MiB/s, done.
Resolving deltas: 100% (216/216), done.
Updating files: 100% (63/63), done.

Current working directory set to: /content/kopptools


#### **commit changes and push to github**

In [None]:
!git add .
!git commit -m "Your commit message"
!git push origin main  # Or your branch name!git commit -m "Your commit message"

On branch main
Your branch is up to date with 'origin/main'.

nothing to commit, working tree clean
Everything up-to-date


#### **install libraries to read pdfs**

In [None]:
!pip install pymupdf
!pip install icecream
!pip install tqdm

Collecting pymupdf
  Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.5-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.25.5


### **read pdf and convert it to a txt-file** ####

In [None]:
import pymupdf

file = 'rs19831215_1bvr020983.pdf'

doc = pymupdf.open(file) # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate over the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
    print(text)
out.close()


#### **draw rectangle**

In [None]:
import fitz  # PyMuPDF

def draw_rectangle(page, x, y, width, height):
    # Draw a rectangle on the page
    page = doc[page]
    rect = fitz.Rect(x, y, x + width, y + height)

    page.draw_rect(
        rect,
        color=(1, 0, 0),      # red
        width=2,              # 2 pt
        fill=None,            # fill
        overlay=True          # overlay
        )

# open pdf
file = 'rs19831215_1bvr020983.pdf'
doc = fitz.open(file)

# define rectancle by x, y - coordinates, width and height
values = [50, 47, 60, 20] #

x, y, width, height = values #
draw_rectangle(page=1, x=x, y=y, width=width, height=height)

#### **read out text from rectangle into pandas dataframe**

In [None]:
import pandas as pd
import fitz
from tqdm import tqdm

def read_words_from_rectangle(page_num, rect_list, doc):
    page = doc[page_num]
    x, y, w, h = rect_list
    clip = fitz.Rect(x, y, x + w, y + h)

    words = page.get_text("words", clip=clip)
    dd = page.get_text("dict", clip=clip)

    spans = []
    for block in dd["blocks"]:
        for line in block["lines"]:
            for span in line["spans"]:
                spans.append({
                    "font": span["font"],
                    "size": span["size"],
                    "bbox": fitz.Rect(span["bbox"])
                })

    records = []
    for x0, y0, x1, y1, word, blk, ln, wn in words:
        center = fitz.Point((x0 + x1) / 2, (y0 + y1) / 2)
        font, size = None, None
        for sp in spans:
            if sp["bbox"].contains(center):
                font, size = sp["font"], sp["size"]
                break
        records.append({
            "word": word,
            "x0": x0, "y0": y0, "x1": x1, "y1": y1,
            "block": blk, "page": page_num,
            "font": font, "size": size
        })

    return pd.DataFrame(records)

def pdf2df(file, rect=[40,70,335,470]):
    doc = fitz.open(file)
    df_all = []
    for pg in tqdm(range(len(doc)), desc="Processing pages", unit="pg"):
        df_all.append(read_words_from_rectangle(pg, rect, doc))
    df = pd.concat(df_all, ignore_index=True)
    return df.sort_values(['page','y1','x0']).reset_index(drop=True)

file = 'rs19831215_1bvr020983.pdf'
df = pdf2df(file)
df
