# Learn.co CleanUp

There is two reasons to do this.

**A)** You want a nice `github` repo where you can quickly look through all your lessons. Who doesn't want that?

**B)** Your local file structure is a mess. You saved some stuff over here. You saved some stuff over there. Then you went in a new direction with a naming scheme.

`git` has its own idea of nested file structure. It uses _**submodules**_ to reference different repos. The following code automates the process of adding all the lessons you have done to a single `github` repo. 

A side effect of adding a `submodule` to a repo is that `git` clones that file to your local machine again, creating a set of redundant files. Once you are done with this process, the choice is yours of what files you wish to keep locally. If you are inclined to keep the the organizational structure introduced here, feel free. Otherwise, just delete the recently cloned files.

___

## 1) First make a brand new repo where all this stuff will live.

The first step is a new repo that will act as a table of contents for all of the Learn.co lessons.

Log onto `git` and locate the <strong> + </strong> icon in the upper right corner.

<img src = imgs/git_new_repo_click.png style="height: 400px; width:1400px; resize:both">

<br>

Now title your repo as you wish. Make sure to check the `README` box before finalizing your choice.

<img src = imgs/git_new_repo_page.png style="height:500px">

<br>

Now clone this file to your machine.

Finally, copy this notebook into this new repo. All the commands were written for this to be the case

## 2) Download the HTML of Learn.co 

When logged in from main landing page of Learn.co, download the HTML of this page. Save this file in the repo we created during Step 1.

<img src=imgs/html_save.png>

<br>

Congratulations, the hard work is done.

## 3) Run Some Cells

In [1]:
from bs4 import BeautifulSoup
import re

In [2]:
with open('Learn - Data Science Career v1.1.html','r') as f:  #Make sure this file matches the name of HTML file you just saved.
    html = f.read()
soup = BeautifulSoup(html, "html.parser")

In [3]:
chunk = str(soup('script',{'type' : 'text/javascript'}))
#chunk

The information we want is in a javascript block. Uncomment above to see what it is.  
Instead of bringing in any more libraries, we will tackle this problem with our `regex` knowhow. 

### RegEx

In [4]:
pattern = re.compile(r"learn-co-curriculum/d.*?(?=\")") #make a pattern object
repo_names = pattern.findall(chunk)

Uncomment the cell below to what we collected.

In [10]:
#repo_names[:10]

['learn-co-curriculum/dsc-0-00-01',
 'learn-co-curriculum/dsc-0-00-04-blogging',
 'learn-co-curriculum/dsc-1-01-02-introduction-summary',
 'learn-co-curriculum/dsc-1-01-03-problems-ds-can-solve',
 'learn-co-curriculum/dsc-1-01-04-the-data-science-process',
 'learn-co-curriculum/dsc-1-01-05-setting-up-environment',
 'learn-co-curriculum/dsc-1-01-06-working-with-lessons-on-learn',
 'learn-co-curriculum/dsc-1-01-07-working-with-lessons-on-learn-lab',
 'learn-co-curriculum/dsc-1-01-08-your-first-data-science-codealong',
 'learn-co-curriculum/dsc-1-01-09-variable-assignment']

## Personalize the URLs

In [9]:
cohort = 'online-ds-pt-100118' #Enter your cohort as a string
github = 'https://github.com/Socjon/' #Enter your github URL. Be sure to include a trailing backslash

In [10]:
mod_full_urls = {'mod_1': [],
                 'mod_2': [],
                 'mod_3': [],
                 'mod_4': [],}

for name in repo_names:
    name = name.lstrip('learn-co-curriculum/')
    
    if name.startswith('dsc-1-') or name.startswith('dsc-00-') or name.startswith('dsc-01-'):
        mod_full_urls['mod_1'].append(github + name + '-' + cohort)
    
    elif name.startswith('dsc-2-'):
        mod_full_urls['mod_2'].append(github + name + '-' + cohort)
        
    elif name.startswith('dsc-3-'):
        mod_full_urls['mod_3'].append(github + name + '-' + cohort)
        
    elif name.startswith('dsc-4-') or name.startswith(f'dsc-04-'):
        mod_full_urls['mod_4'].append(github + name + '-' + cohort)

# Mark and David

The below is just a truncated list. I want to ensure this process works on other machines before I put work on other features.

I have already set all the further iterations to go over this smaller list. I plan to divide the lessons section wise later on.

Inspect the below output, make sure they look good.

In [21]:
trial = mod_full_urls['mod_1'][:10]
trial

['https://github.com/Socjon/dsc-1-01-02-introduction-summary-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-04-the-data-science-process-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-05-setting-up-environment-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-06-working-with-lessons-on-learn-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-07-working-with-lessons-on-learn-lab-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-08-your-first-data-science-codealong-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-09-variable-assignment-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-10-strings-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-11-strings-lab-online-ds-pt-100118']

# Interacting with the OS

___
___

In [22]:
import subprocess, shlex

to_fork = []
mod_name = 'Module_1'                                           #Enter the Module you want to clean up first.
mod_dict = mod_full_urls['mod_1']                               #Enter the correct key for the above module




###!!!! THIS IS WHERE ISSUES MAY OCCUR DEPENDING ON OS TYPES!!!!
### This is the main area of focus, making sure this block works
### Thank you again for your help. Let me know if you run into issues


subprocess.Popen(f'mkdir {mod_name}', shell=True)               #Don't use (shell=True) lightly
                                                                #https://stackoverflow.com/questions/3172470/actual-meaning-of-shell-true-in-subprocess
                                                                



for url in trial:                                               ### trial list --needs to be updated when fully utlize
    command = f'git submodule add {url}'                        #Iterate over the URL dictionary we created before and adds them as submodules.
    kwargs = {}
    kwargs['stdout'] = subprocess.PIPE
    kwargs['stderr'] = subprocess.PIPE
    proc = subprocess.Popen(shlex.split(command), **kwargs, cwd = f'{mod_name}')
    (stdout_str, stderr_str) = proc.communicate()
    return_code = proc.wait()
    #print (stdout_str)
    #print (stderr_str)


    to_check = stderr_str.decode('utf-8')                     #Changing the terminal output from btye to str
    print(to_check)                                           #Prints status updates --optional
    pattern = 'fatal'                                         #Making a patter to loop through and find all the non exsistant URLs
    if to_check.find(pattern) > 0:
        to_fork.append(url)
            
commands = ["git commit -m 'adding a submodule'", "git push"]    #Pushing all the changes
for command in commands:
    subprocess.Popen(shlex.split(command))

print('Go create the following')
print(to_fork)

'Module_1/dsc-1-01-02-introduction-summary-online-ds-pt-100118' already exists in the index

Cloning into 'C:/Users/J/DS/Flatiron/Module_1/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118'...
remote: Repository not found.
fatal: repository 'https://github.com/Socjon/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118/' not found
fatal: clone of 'https://github.com/Socjon/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118' into submodule path 'C:/Users/J/DS/Flatiron/Module_1/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118' failed

'Module_1/dsc-1-01-04-the-data-science-process-online-ds-pt-100118' already exists in the index

'Module_1/dsc-1-01-05-setting-up-environment-online-ds-pt-100118' already exists in the index

'Module_1/dsc-1-01-06-working-with-lessons-on-learn-online-ds-pt-100118' already exists in the index

'Module_1/dsc-1-01-07-working-with-lessons-on-learn-lab-online-ds-pt-100118' already exists in the index

Cloning into 'C:/Users/J/DS/Flatiron/Module_1/ds

## Helper Function
After running the above cell, there may be some Learn lessons that you didn't fork: Section Recaps, Introductions, etc.  

I have included a helper function to open all the URLS that need to be cloned before the above needs to be run again. Once they have been forked, just rerun the above cell to add them as `submodules`.
   

In [23]:
#to_fork

['https://github.com/Socjon/dsc-1-01-03-problems-ds-can-solve-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-08-your-first-data-science-codealong-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-09-variable-assignment-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-10-strings-online-ds-pt-100118',
 'https://github.com/Socjon/dsc-1-01-11-strings-lab-online-ds-pt-100118']

In [15]:
import webbrowser 
learn = 'https://github.com/learn-co-students/'

In [24]:
def let_there_be_tabs(url_list):
    learn = 'https://github.com/learn-co-students/'
    for url in url_list:
        webbrowser.open(url.replace(github, learn))

In [25]:
#let_there_be_links(to_fork)