# Software Introduction

####

### 1. Anaconda
- An all-inclusive resort for running software: jupyter notebook, python, and others.
- Software runs inside their own local environment, preventing conflicts with the environment of your physical machine.
- Default local environment is "base", but one can create new environments per project/pipeline.
    
> *An environment is a virtual computer with (in our case) its own Python installation, python libraries/packages, and specific library/package versions, which can be completely different from what is installed on the physical machine or other environments.*
>
>> **When linked to a python pipeline**
>> - Guarantees successful runs across time.
>>   - Runs under the same conditions in which it was ***successfully*** built.
>>   - Prevents breaks caused by updates to python packages in the future. 
>> - Guarantees successful runs across separate physical machines.
>>   - Gives the user the software recipe to build the correct environment, instead of using their machine's native environment.

####

### 2. Jupyter Notebook
- Included in the Anaconda distribution.
- Useful when you want to run python interactively: explore the data, generate plots, run and test python scripts as one builds them. 
- To run a python script non-interactively, one exports the notebook as an **Executable Script**. This creates a python/**.py** file used to either:
> - store python functions that collectively build a package — packages allow you to easily call functions inside other python scripts.
> - executes the pipeline: reads data, processes data into results, outputs results.
>> ***Note:*** *jupyter notebooks are ***not*** used to run the pipeline or build the pipeline's associated python package.*
- Has ability to run under any python environment (or "kernel") created within Anaconda — i.e. can emulate the true pipeline environment.
    
####

### 3. Visual Studio Code
- A text editor with python syntax highlighting.
- It can run python code but I use it to easily edit python scripts (.py), rather than run code. 

####

### 4. Github
- **Github** is a cloud service that stores repositories of code, allowing for simultaneous collaboration. 
> ***It's Box but for code, not data!***
- Cloud repositories are downloaded, or ***cloned***, onto your physical machine. 
> - make changes locally
> - **push** those changes to the cloud repository

####

### 5. Git

- **Git** is version control softare through which your local repository and the cloud repository communicate.
- It comes with a suite of commands that give users control over that relationship.
- The emphasis is on **CONTROL**. 
> - User control can become very complex when there are several collaborators making changes daily and simultaneously.     
>> **For or our purposes, we will be avoiding that complexity and only need to learn a small number of commands, simple version-control concepts, and basic best practices.**

**Main Git commands run locally in Git Bash**
> - `git init` — initialize local folder for github
> - `git clone <repository url>` — copy a repo in the cloud to your local computer
> - `git checkout -b <name>` — create a "branch" on which to make changes to the repo.
> - `git status` — what is different between your local repo and the cloud repo
> - `git commit -m "<description>"` — package all the local changes under a single purpose, preparing them to be pushed to the cloud  
> - `git add <file>` or `git add .` — add an untracked file or all untracked files (".") to be added to the commit.
> - `git rm <file>` — remove a file from being tracked by the repo 
> - `git push -u origin main` or `git push -u origin <branch>` — push the local changes to the repo or to a branch off the repo.
> - `git config --global user.name <your name>` — connect your github account to your local machine
> - `git config --global user.email <your email>` — connect your github account to your local machine 

####

### 6. Git Bash

- **Git Bash** is a command-line terminal for Windows that lets you run Git commands and, as a bonus, many Linux-style system commands inside a Bash shell.
- Python was built for Unix/Linux systems.
- Most servers run Linux with Bash, so nearly all software and data engineers are fluent in Bash—much more than Windows command line.

**Common Linux-Bash system commands**

> - `cd <path/to/folder>` — navigate to a folder
> - `pwd` — print the directory you are in
> - `mkdir <name>` — create a folder  
> - `ls` — list files in the current directory  
> - `cp <file> <path>` — copy a file to a new location  
> - `rm <file>` — delete a file  
> - `touch <file.ext>` — create an empty file with the given extension


# _________________________________________________________________________________
# _________________________________________________________________________________
####


# Initial Setup — Only Run Once
####

## Sign up for a Github account with your UD email

- Go to `https://github.com/signup`
- <mark>Warning</mark>: Change the username assigned to you to your UD username. 
- You now can be added as a contributor to repos.

####

## Connect to the *ir-team-retreat* Repository

####

## STEP 1 — Open Git Bash
>
>####
>
## STEP 2 — Create local directories
>
>> - `mkdir -p C:/Users/<username>/ir/ir-pipelines`
>>   
>> - `cd C:/Users/<username>/ir/ir-pipelines`
>>
>> - `pwd`
>
>####
>
## STEP 3 — Update Git credentials
>
>>
>> - `git config --global user.name "first last"`
>> - `git config --global user.email "udayton email"`
>
>####
>
## STEP 4 — Clone the Repository
>
>> - `git clone https://github.com/sruddy1/ir-team-exercise.git`
>
>####
>
## STEP 5 — Create Your Personal Branch
>
>> - `cd ir-team-exercise`
>>
>> - `pwd`
>> 
>> - `git checkout -b team-exercise/<your-first-name>`
>>
>>   - Creates a ***branch*** in Github, which is essentially your own copy of the repo.
>>   - Disconnects your changes from the root repo, referred to as "main" or "master"
>>   - Allows changes to be made and pushed to Github without updating "main".
>>   - Prevents conflicts between users who are using separate branches.
>
>####
>
## STEP 6 — Authenticate Git Account
>
>> - `git credential-manager github login`
>>
>> - An authentication window will pop up. Go through the steps to authenticate.
>>
>> - Now your local machine is authenticated and linked to your github account.
>
>####
>

####

## Set up GitBash with Anaconda Commands and Python

####

## STEP 1 — Open Visual Studio Code
>
> - Click the 5th icon down on the left pane 
> - Click the install button for 3 python plugins: Pylance, Python, and Python debugger
>
>####
>
## STEP 2 — Turn on View Hidden Files
>
>> - Open `File Explorer` > `View` > `Show` > Select `Hidden Items`
>
>####
>
## STEP 3 — Create ~/.bashrc
>
>> - Open Git Bash
>> - (type) `cd ~`
>> - (type) `touch ~/.bashrc`
>
>####
>
## STEP 4 — Update ~/.bashrc
>
>> - Open VSC
>>
>> - Go to `File` > `Open File` > Open `C:\Users\<username>\.bashrc`
>>
>> - Confirm/Find path to conda.sh based on installation type:
>>
>>   - All Users Installation: `/c/ProgramData/anaconda3/etc/profile.d/conda.sh`
>>   - Single User Installation: `/c/Users/<username>/AppData/Local/anaconda3/etc/profile.d/conda.sh`
> 
>> - Confirm/Find path to anaconda3 folder based on installation type:
>>
>>   - All Users Installation: `/c/ProgramData/anaconda3`
>>   - Single User Installation: `/c/Users/<username>/anaconda3`
>
>> - Add the following command
>>
>>   ```
>>   if [ -f "/c/Users/<username>/AppData/Local/anaconda3/etc/profile.d/conda.sh" ]; then
>>    . "/c/Users/<username>/AppData/Local/anaconda3"
>>   fi
>>   ```
> 
>> - Add the following command
>>
>>   `export PATH="/c/Users/<username>/AppData/Local/anaconda3:$PATH"`
>>
>
>> - Save & Close File
>
>> - Open GitBash (if already open, close and reopen)
>>
>>   - `source ~/.bashrc`
>>
>>   - `conda init bash`
>>  
>>   - Close GitBash 




# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Run an Existing Pipeline Stored on GitHub


####

## Update Configuration File

####

>#### STEP 0 — Clone the Repository
>
>> - Open GitBash
>>
>> - `cd .../ir/ir-pipelines`
>> 
>> - if you haven't cloned the repo yet: `git clone https://github.com/sruddy1/ir-team-exercise.git`
>>
>> - `cd ir-team-exercise`
>>
>> - `git checkout -b team-exercise/<your-first-name>`
>>
>>   - Creates a ***branch*** in Github, which is essentially your own copy of the repo.
>>   - Disconnects your changes from the root repo, referred to as "main" or "master"
>>   - Allows changes to be made and pushed to Github without updating "main".
>>   - Prevents conflicts between users who are using separate branches. 

>#### STEP 1 — Open Visual Studio Code
>
>####
>
>#### STEP 2 — Open the Pipeline Configuation File in VSC
>
>> - Click `File` > `Open File`
>>   
>> - Navigate to `C:\Users\<username>\ir\ir-pipelines\ir-team-exercise\configs\config.yaml`
>
>####
>
>#### STEP 3 — Change file paths to match your local machine
>
>> - **name**: 'First name'
>> - **box_root**: C:/Users/`<username>`/Box
>
>####
>
>#### STEP 4 — Save & Close

####

## Build & Activate the Python Virtual Environment

####

>#### STEP 1 — Open Git Bash
>
>####
>
>#### STEP 2 — Deactivate Existing Environment
>
>> - `deactivate`
>>   - a `deactivate: command not found` is good!
>>   - if the error does not appear that's also good, and it means you had an environment active that is now deactivated.
>
>####
>
>#### STEP 3 — Build the Virtual Environment
>
>> - `cd C:/Users/<username>/ir/ir-pipelines/ir-team-exercise`
>>   
>> - `python -m venv .venv`
>
>#### 
>
>#### STEP 4 — Activate the Virtual Environment
>
>> - `source .venv/Scripts/activate`
>>
>>   - This loads the python version used to build the pipeline.
>

####

## Install Packages into the Virtual Environment

####

>#### STEP 1 — Install External Python Packages into the Virtual Environment
>
>> - `pip install -r requirements.txt`
>>
>>   - `requirements.txt` contains a list of packages along with their version numbers used to build the pipeline.
>
>####
>
>#### STEP 2 — Install the Pipeline Python Package into the Virtual Environment
>
>> - `pip install -e .`
>>
>> - This installs all the `*.py` files contained in `./src/ir_team_exercise` as a python package inside the virtual environment
>>
>> - Allows you to access the functions inside python scripts, e.g. if you want to use the function 'validate_columns' contained inside `./src/ir_team_exercise/checks.py` then you would add this to your python script:
>>
>>   `from ir_team_exercise.checks import validate_columns`
>
>####
>
>#### STEP 3 — Make the virtual environment selectable by Jupyter Notebook
>
>> - `python -m ipykernel install --user --name=myenv --display-name "ir-team-exercise"`
>
>####


####

## Run the Pipeline

####

>
> - `python run.py`
>
> - `deactivate`
>


####

## Check Results Folder

####

> - Use File Explorer to navigate to `C:\Users\<username>\Box\Inst Res Collab\Team Retreat Pipeline Results\<First Name>` and confirm the pipeline successfully output the results file.

####

## Push Branch to Github

####

> - `git status` : see what files have been changed locally
>   
> - `git add .` : add all changed files to the commit
>> - Ignore `LF will be replaced by CRLF...` warning.
>
> - `git commit -m "Successfully ran pipeline"` : collect changes into a commit
>   
> - `git push -u origin team-exercise/<name>` : push commit to the github branch of the repo (this does not change the main repo)

####

## Run Branch in the Future

####

> - Open Git Bash
> - `cd C:/Users/<username>/ir/ir-pipelines/ir-team-exercise` : Navigate to the repo on your local machine.
> - `git checkout team-exercise/<name>` : if the analysis hasn't been updated, otherwise git checkout a new branch from main as done previously
> - Then,
>> - Update Config File if needed.
>> - Build environment (only if you haven't previously)
>> - Activate environment
>> - Install External Packages (only if requirements have changed since last build)
>> - Install Pipeline Package (only if changes have been made to the src/ .py files
>> - Run Pipeline
>> - Deactivate Environment
>> - git add > commit > push changes (if you want to keep a record of any changes to tracked files)

####

## All Commands

####

> - Open Git Bash
>   
> - if repo not currently on local machine
>>   - `git clone https://github.com/<github-account-name>/<name-of-repo>.git`
>     
> - `cd <path-to-repo>/<name-of-repo>`
>
> - `git checkout main` : necessary in case your local repo is linked to an existing branch.
>
> - `git checkout -b <branch-folder>/<branch-name>`
>
> - Update config file using VSC located here: `<name-of-repo>/configs/config.yaml`
>
>> - Make sure all the directories that are referenced in the config file exist.
>
> - `python -m venv .venv` : only run if you haven't already created the .venv folder
>
> - `source .venv/Scripts/activate` 
>
> - `pip install -r requirements.txt` : run only if new requirements were added since the last time you built the virtual env
>
> - `pip install -e .` : run everytime you make changes to the `*.py` files in src
>
> - `python run.py`
>
> - `deactivate`
>
> - `git add .`
>
> - `git commit -m "Type informative message"`
>
> - `git push -u origin <branch-folder>/<branch-name>`
>
####


# Create a Python Pipeline

####

### Folder Structure


# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Convert Python Pipeline to a Git Repository

####


# _________________________________________________________________________________
# _________________________________________________________________________________
####

# DocStrings

```
    """
    # <Purpose of function>

        <optional description>

    Parameters
    ----------
    > <param_1> : <data_type>
        
        <description of param_1> <optional example>
    
    > <param_2> : <data_type>
        
    >> default : <param_2_default_value>

        <description of param_2> <optional example>

    Returns
    -------
    > <data_type>
        
        <description of what is returned> 

    Example
    -------
    > <param_1> = <param_1_value>

    > <param_2> = <param_2_value>

        return 
        
            <example_function_output>

            <optional description>
 
    #
    """

```

# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Adding New Submodule w/ Documentation to ir-docs-portal Repo

>#### Step 1: Add submodule <reponame\> to documentation repo (ir-docs-portal)
> - Open GitBash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `git submodule add https://github.com/sruddy1/<reponame\>.git ir-pipelines/<reponame\>
> - `git submodule update --init`
> - `git add .gitmodules ir-pipelines/<reponame>`
> - `git commit -m "added <reponame\>"
> - `git push`
>
>#### Step 2: Add new repo documentation to mkdocs flow
> - Open VSC
> - File -> Open File
> - Navigate to and open `ir-docs-portal/scripts/gen_portal_ref_pages.py`
> - Find python variable `PIPELINE_LABELS`
> - Add another row to this variable in the same format as the other rows: "<reponame\>": "<pretty label\>"
>> Example
>> 
>> `"ir-enrollment-projection": "Enrollment Projection"`
>> 
>> <mark>Warning: You need to add a ',' to the line above the new line you have added.</mark>
>>
> - Save and close file
> - File -> Open File
> - Navigate to and open `ir-docs-portal/mkdocs.yml`
> - Locate plugins -> mkdocstrings -> handlers -> python -> paths
> - Add a new row in the same format as the other rows: - ir-pipelines/<reponame\>/src
> - Save & close file
>
>#### Step 3: Add, Commit, Push changes to ir-docs-portal Repo
>
> - Open up gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `git add .`
> - `git commit -m "added <reponame>"`
> - `git push`
>
>#### Step 4: Update Github
>
> - Open up Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `mkdocs gh-deploy --force`
> - View documentation here: `https://sruddy1.github.io/ir-docs-portal/`
>><mark>Note: May take several minutes for update to take effect</mark>

# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Updating Docs for Existing Submodules

>#### Step 1: Update submodules inside ir-docs-portal Repo
>
> - Open Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `git submodule update --remote --merge`
>
>#### Step 2: Add, Commit, Push changes to ir-docs-portal Repo
>
> - Stay in Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `git add ir-pipelines`
> - `git commit -m "update submodules"
> - `git push`
>   
>#### Step 3: Update Github
>
> - Open up Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `mkdocs gh-deploy --force`
> - View documentation here: `https://sruddy1.github.io/ir-docs-portal/`
>><mark>Note: May take several minutes for update to take effect</mark>

# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Adding New Markdown File to ir-docs-portal Sidebar

## Step 1: Create blank .md file
>
> - Open up Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal/docs`
> - `touch <filename>.md`
>
>####
>
## Step 2: Edit .md file
>
> - Open up VSC
> - File -> Open File
> - Naviate to and open `~/ir/ir-pipelines/ir-docs-portal/docs/<filename>.md`
> - Write or copy-paste your markdown text
> - Save & Close file
>
>####
>
## Step 3: Update mkdocs.yml
>
> - Stay in VSC
> - File -> Open File
> - Naviate to and open `~/ir/ir-pipelines/ir-docs-portal/mkdocs.yml`
> - Locate `nav:`
> - Add a new row in the same format as the other lines:
>   
>   `- <pretty label>: <filename>.md`
>
> - Save & Close file
>
>####
>
## Step 4: Add, Commit, Push changes to ir-docs-portal Repo
>
> - Open Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `git add .`
> - `git commit -m "add <filename>.md to docs"
> - `git push`
>
>####
>   
## Step 5: Update Github
>
> - Stay in Gitbash
> - `cd ~/ir/ir-pipelines/ir-docs-portal`
> - `mkdocs gh-deploy --force`
> - View documentation here: `https://sruddy1.github.io/ir-docs-portal/`
>><mark>Note: May take several minutes for update to take effect</mark>


# _________________________________________________________________________________
# _________________________________________________________________________________
####

# Connecting to ODS via Python

## Step 1: Install python packages
>
> - Open up GitBash
> - Run:
> ```bash
> conda activate
> pip install sqlalchemy oracledb
> ```
####
## Step 2: Set up Connection in Python
>
> - Open up an existing Jupyter Notebook, Python Script via VSC, or Launch a New Jupyter Notebook.
>> To launch a new jupyter notebook:
>> - Open up Anaconda
>> - Click `Launch Jupyter`
>> - Navigate to the appropriate directory
>> - Click `New`
>
> - Set up Connection by copy-pasting and running the following python code
>> <mark>Note: update user and password as needed.</mark>
> ```python
> from sqlalchemy import create_engine
> 
> USER = "ud_tableau_read"
> PWD  = <password from Keeper>
> HOST = "ban-odsp-db1.servers.udayton.edu"
> PORT = 1521
> SERVICE = "odsp.servers.udayton.edu"
> 
> engine = create_engine(
>    f"oracle+oracledb://{USER}:{PWD}@{HOST}:{PORT}/?service_name={SERVICE}"
> )
> ```
####
## Step 3: Write a SQL Command
>
> - SQL in Python looks like
> ```python
> sql = """
> SELECT <...>
> FROM <schema>.<table>
> <...>
> """
> ```
>
> - Execute SQL command with the engine via pandas
> ```python
> df = pd.read_sql(sql, engine)
> ```
>####
> Examples
>
> - Load UD_STUDENT_COURSE
> ```python
> sql = """
> SELECT *
> FROM UD_STUDENT.UD_STUDENT_COURSE
> """
>
> df_enrl = pd.read_sql(sql, engine)
> ```
>####
>
> - List all Schemas
> ```python
> sql = """
> SELECT DISTINCT owner
> FROM all_tab_columns
> ORDER BY owner
> """
> df_schemas = pd.read_sql(sql, engine)
> df_schemas
> ```
>####
>
> - List all tables in the UD_STUDENT schema
> ```python
> sql = """
> SELECT DISTINCT table_name
> FROM all_tab_columns
> WHERE owner = 'UD_STUDENT'
> ORDER BY table_name
> """
> df_tables = pd.read_sql(sql, engine)
> df_tables
> ```

email: decisionsupport@udayton.edu
pw: RF#tFKX5C<ji
username: decisionsupportteam
