## Document Loader using Langchain for Unstructured data 
- Reading PDF files
- Reading video files from youtube url, converting it to audio file and transcribing into text
- Reading websites


In [1]:
import os
import datetime
from dotenv import load_dotenv, find_dotenv
import openai

In [2]:
_ = load_dotenv(find_dotenv())

In [3]:
openai.api_key = os.environ['OPENAI_API_KEY']

In [4]:
from langchain.document_loaders import PyPDFLoader

In [5]:
pip install pypdf

Note: you may need to restart the kernel to use updated packages.


In [6]:
loader = PyPDFLoader("docs/MachineLearning-Lecture01.pdf")
pages = loader.load()
print(len(pages))

22


In [7]:
pages[0].page_content

'MachineLearning-Lecture01  \nInstructor (Andrew Ng):  Okay. Good morning. Welcome to CS229, the machine \nlearning class. So what I wanna do today is ju st spend a little time going over the logistics \nof the class, and then we\'ll start to  talk a bit about machine learning.  \nBy way of introduction, my name\'s  Andrew Ng and I\'ll be instru ctor for this class. And so \nI personally work in machine learning, and I\' ve worked on it for about 15 years now, and \nI actually think that machine learning is th e most exciting field of all the computer \nsciences. So I\'m actually always excited about  teaching this class. Sometimes I actually \nthink that machine learning is not only the most exciting thin g in computer science, but \nthe most exciting thing in all of human e ndeavor, so maybe a little bias there.  \nI also want to introduce the TAs, who are all graduate students doing research in or \nrelated to the machine learni ng and all aspects of machin e learning. Paul Baumstar

In [8]:
pages[0].metadata

{'source': 'docs/MachineLearning-Lecture01.pdf', 'page': 0}

In [9]:
from langchain.document_loaders.generic import GenericLoader
from langchain.document_loaders.parsers import OpenAIWhisperParser
from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader

In [10]:
#pip install --quiet --upgrade yt_dlp;

In [11]:
#pip install --quiet --upgrade pydub

In [12]:
#pip install --quiet --upgrade ffmpeg

In [13]:
#pip install --quiet --upgrade ffprobe

In [14]:
yt_url = "https://youtu.be/DlxiE-BMLRQ?si=96W5IrfQhxLCNqFj"
save_dir = "docs/youtube_converted_Audio"
loader = GenericLoader(
    YoutubeAudioLoader([yt_url], save_dir=save_dir),
    OpenAIWhisperParser()
)
files = loader.load()

[youtube] Extracting URL: https://youtu.be/DlxiE-BMLRQ?si=96W5IrfQhxLCNqFj
[youtube] DlxiE-BMLRQ: Downloading webpage
[youtube] DlxiE-BMLRQ: Downloading ios player API JSON


         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = eVXbJTmEFVLX7sPUX ; player = https://www.youtube.com/s/player/b22ef6e7/player_ias.vflset/en_US/base.js
         Install PhantomJS to workaround the issue. Please download it from https://phantomjs.org/download.html
         n = qjQFhiP2YAclL7pGt ; player = https://www.youtube.com/s/player/b22ef6e7/player_ias.vflset/en_US/base.js


[youtube] DlxiE-BMLRQ: Downloading m3u8 information
[info] DlxiE-BMLRQ: Downloading 1 format(s): 140
[download] 100% of   22.22MiB
Transcribing part 1!
Transcribing part 2!


In [15]:
print(len(files))

2


In [16]:
print(files[0])



In [17]:
from langchain.document_loaders import WebBaseLoader

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [18]:
loader = WebBaseLoader("https://github.com/basecamp/handbook/blob/master/37signals-is-you.md")

In [19]:
files = loader.load()

In [20]:
print(len(files
         ))

1


In [22]:
files[0].page_content

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nFile not found · GitHub\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n\n\n\n\nNavigation Menu\n\nToggle navigation\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n            Sign in\n          \n\n\n\n\n\n\n\n\n        Product\n        \n\n\n\n\n\n\n\n\n\n\n\n\nActions\n        Automate any workflow\n      \n\n\n\n\n\n\n\nPackages\n        Host and manage packages\n      \n\n\n\n\n\n\n\nSecurity\n        Find and fix vulnerabilities\n      \n\n\n\n\n\n\n\nCodespaces\n        Instant dev environments\n      \n\n\n\n\n\n\n\nGitHub Copilot\n        Write better code with AI\n      \n\n\n\n\n\n\n\nCode review\n        Manage code changes\n      \n\n\n\n\n\n\n\nIssues\n        Plan and track work\n      \n\n\n\n\n\n\n\nDisc