In [90]:
import requests

In [3]:
response = requests.get("https://carl-mcbride-ellis.github.io/TOBoML/TOBoML.pdf")

In [83]:
response = requests.get("https://tanthiamhuat.files.wordpress.com/2018/03/deeplearningwithpython.pdf")

In [84]:
with open("deeplearningwithpython.pdf", "wb") as pdf_file:
    pdf_file.write(response.content)

In [87]:
doc = pymupdf.open("deeplearningwithpython.pdf")

In [88]:
content = ""
word_count = 0 

for page in doc:
    content += page.get_text()

In [89]:
len(content.split(" "))

96904

In [4]:
with open("TOBoML.pdf", "wb") as pdf_file:
    pdf_file.write(response.content)

In [63]:
import pymupdf

In [64]:
doc = pymupdf.open("TOBoML.pdf")

In [None]:
doc = pymupdf.open("TOBoML.pdf")

In [13]:
content = ""

for page in doc[:15]:
    content += page.get_text()

In [85]:
content = ""

for page in doc:
    content += page.get_text()

In [86]:
len(content.split(" "))

27325

In [69]:
with open("test_content_large.txt", "w") as text_file:
    text_file.write(content)

In [70]:
from langchain_openai import ChatOpenAI

from dsview.content_loader import TextLoader
from dsview.content_extraction import ContentExtractor

In [71]:
llm = ChatOpenAI(temperature=0, model_name="gpt-4o-mini")


In [72]:
content_extractor = ContentExtractor(llm)
content_loader = TextLoader("test_content_large.txt")

In [5]:
summary, topics, links = content_extractor.extract_content(content_loader)

**Summary of "The Orange Book of Machine Learning" by Carl McBride Ellis**

**Theme and Context:**
"The Orange Book of Machine Learning" serves as a comprehensive guide to the fundamentals of supervised machine learning, focusing on regression and classification techniques for tabular data. The book is structured to provide both theoretical insights and practical applications, primarily using Python and the scikit-learn library. It is based on a five-day course taught across various Spanish cities, aiming to equip readers with essential skills in data science.

**Key Concepts and Takeaways:**

1. **Introduction to Machine Learning:**
   - The book emphasizes the distinction between statistical models (for explanation) and machine learning models (for prediction). It introduces key concepts such as interpolation, curve fitting, errors vs. residuals, and sources of uncertainty (aleatoric and epistemic).

2. **Statistical Foundations:**
   - Essential statistical concepts are covered, including measures of central tendency (mean, median, mode), dispersion (variance, MAD, quartiles), and the Gaussian distribution. These concepts are crucial for understanding data and machine learning estimators.

3. **Exploratory Data Analysis (EDA):**
   - The book discusses the importance of data quality, descriptive statistics, and visualization techniques (e.g., box plots, scatter plots) to understand data distributions and relationships.

4. **Data Cleaning:**
   - Techniques for handling missing values, outliers, and duplicated rows are outlined, along with methods for feature scaling and dealing with categorical features.

5. **Model Evaluation:**
   - The text covers cross-validation techniques, data leakage, and the importance of understanding covariate shift and concept drift in model performance.

6. **Regression Techniques:**
   - Various regression methods are discussed, including linear regression, polynomial regression, and decision tree regression. The book also addresses overfitting and introduces metrics for evaluating model performance (e.g., RMSE, MAE).

7. **Classification Techniques:**
   - The book explores logistic regression, decision trees, and classification metrics, emphasizing the challenges of imbalanced datasets and the importance of classifier calibration.

8. **Ensemble Methods:**
   - Techniques such as Random Forest, boosting, and gradient-boosted decision trees are introduced, highlighting their effectiveness in improving model accuracy.

9. **Hyperparameter Optimization:**
   - The importance of tuning model parameters to enhance performance is discussed, along with strategies for effective optimization.

10. **Feature Engineering and Selection:**
    - The book emphasizes the significance of feature engineering and selection techniques, including PCA and various statistical methods for identifying important features.

11. **Interpretability and Explainability:**
    - The text discusses the need for model interpretability, especially in light of regulations like GDPR, and introduces tools like SHAP and LIME for enhancing model explainability.

12. **Exclusion of Neural Networks:**
    - The author explains the decision to focus on traditional machine learning methods rather than deep learning, providing insights into simpler models that can be more interpretable.

Overall, "The Orange Book of Machine Learning" is a practical resource for data scientists and machine learning practitioners, offering a solid foundation in essential concepts and techniques for making predictions with tabular data.

In [73]:
summary, topics, links = content_extractor.extract_content(content_loader)

**Summary of "The Orange Book of Machine Learning" by Carl McBride Ellis**

**Theme and Context:**
The book serves as a comprehensive guide to machine learning, focusing on supervised regression and classification techniques for tabular data. It emphasizes practical applications using Python, particularly with libraries like scikit-learn and pandas, and is structured to facilitate understanding through a blend of theory and hands-on examples.

**Key Concepts and Takeaways:**

1. **Introduction to Machine Learning:**
   - The book begins with foundational concepts such as the distinction between statistical models (explanatory) and machine learning models (predictive).
   - It discusses the importance of understanding errors, residuals, and sources of uncertainty (aleatoric and epistemic).

2. **Statistics and Data Analysis:**
   - Essential statistical concepts are covered, including measures of central tendency (mean, median, mode), dispersion (variance, standard deviation), and distributions (Gaussian, Chebyshev's inequality).
   - Exploratory Data Analysis (EDA) techniques are introduced to assess data quality and inform feature engineering.

3. **Data Cleaning:**
   - The book outlines methods for handling missing values, outliers, and duplicates, emphasizing the importance of data quality in machine learning.

4. **Model Evaluation:**
   - Cross-validation techniques are discussed to ensure models generalize well to unseen data, along with metrics for assessing model performance (e.g., RMSE, MAE, R²).

5. **Regression Techniques:**
   - Various regression methods are explored, including linear regression, polynomial regression, and decision tree regression, with a focus on understanding model assumptions and performance metrics.

6. **Classification Techniques:**
   - The book covers logistic regression, decision trees, and ensemble methods like Random Forest and boosting techniques (e.g., AdaBoost, Gradient Boosting).
   - It emphasizes the importance of metrics such as precision, recall, F1 score, and AUC-ROC for evaluating classification models.

7. **Ensemble Methods:**
   - The advantages of combining multiple models to improve predictive performance are discussed, including techniques like stacking and convex combination of predictions.

8. **Hyperparameter Optimization:**
   - Strategies for tuning model hyperparameters using grid search, randomized search, and other methods are presented to enhance model performance.

9. **Feature Engineering and Selection:**
   - The significance of creating new features and selecting the most relevant ones is highlighted, with techniques such as permutation importance, LASSO, and PCA.

10. **Neural Networks and Deep Learning:**
    - The book concludes with a critical view of neural networks for tabular data, suggesting that traditional models often outperform them in this context.

**Conclusion:**
"The Orange Book of Machine Learning" is a practical resource for data scientists and machine learning practitioners, providing a solid foundation in both the theoretical and practical aspects of machine learning. It emphasizes the importance of data quality, model evaluation, and the thoughtful application of various algorithms to achieve robust predictive performance.

In [11]:
topics.tags

[DataScienceTag(name='Machine Learning'),
 DataScienceTag(name='Statistics'),
 DataScienceTag(name='Data Engineering'),
 DataScienceTag(name='Data Visualization')]

In [12]:
topics.subjects

[DataScienceSubject(type='Organization', name='Carl McBride Ellis', description='Author and educator in machine learning.'),
 DataScienceSubject(type='Product', name='The Orange Book of Machine Learning', description='A comprehensive guide on supervised regression and classification for tabular data.'),
 DataScienceSubject(type='Python library', name='scikit-learn', description='A machine learning library for Python used extensively in the book.'),
 DataScienceSubject(type='Python library', name='pandas', description='A data manipulation library for Python used in conjunction with scikit-learn.')]

In [14]:
from langchain_core.prompts import ChatPromptTemplate

In [75]:
large_pdf_summarization_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are part of a knowledge management system, your role is to extract information regarding Data Science news, articles and courses.
    The source is a pdf document that could be for example a technical report, a course or a scientific article. The entire document is too large to 
    be transmitted in one go, so you will only receive the first 15 pages of the document. Your task is to write a concise summary of the
    themes and concepts that are discussed in the document. If any conclusion is mentioned at an early stage you may mention it, but given the
    limited context you will be provided, please be careful not to infer something that is not written.
    """),
    ("user", "Here is an extract of a pdf document : {content}")
]
)



In [76]:
large_summary_generator = large_pdf_summarization_prompt | llm

In [79]:
summary = large_summary_generator.invoke({"content": content_loader.content * 3})

BadRequestError: Error code: 400 - {'error': {'message': "This model's maximum context length is 128000 tokens. However, your messages resulted in 143914 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}

In [78]:
summary.response_metadata

{'token_usage': {'completion_tokens': 589,
  'prompt_tokens': 48076,
  'total_tokens': 48665},
 'model_name': 'gpt-4o-mini-2024-07-18',
 'system_fingerprint': 'fp_48196bc67a',
 'finish_reason': 'stop',
 'logprobs': None}

In [61]:
# TODO implement monitoring on response metadata

In [54]:
summary.response_metadata

{'token_usage': {'completion_tokens': 600,
  'prompt_tokens': 10023,
  'total_tokens': 10623},
 'model_name': 'gpt-4o-mini-2024-07-18',
 'system_fingerprint': 'fp_f3db212e1c',
 'finish_reason': 'stop',
 'logprobs': None}

The document titled "The Orange Book of Machine Learning" by Carl McBride Ellis serves as an introductory guide to essential concepts in supervised regression and classification for tabular data. The book is structured into several chapters, each focusing on different aspects of machine learning and data analysis.

### Key Themes and Concepts:

1. **Introduction to Machine Learning**:
   - The book emphasizes the importance of data in making predictions and introduces fundamental concepts such as interpolation, curve fitting, errors, and residuals.
   - It distinguishes between statistical models (focused on explanation) and machine learning models (focused on prediction).

2. **Uncertainty in Predictions**:
   - The text discusses two types of uncertainty: aleatoric (inherent randomness) and epistemic (lack of data).
   - It also covers confidence and prediction intervals, which quantify uncertainty in model parameters and predictions, respectively.

3. **Explainability and Interpretability**:
   - The book defines explainability as understanding both the 'how' and 'why' of predictions, while interpretability focuses on the 'why'.
   - It highlights the challenges of interpretability in machine learning models and mentions tools like SHAP and LIME for enhancing model transparency.

4. **Statistical Foundations**:
   - The document covers essential statistical concepts such as measures of central tendency (mean, median, mode), dispersion (variance, MAD), and the Gaussian distribution.
   - It introduces the importance of understanding data distributions and their implications for machine learning.

5. **Exploratory Data Analysis (EDA)**:
   - EDA techniques are discussed, including data quality assessment, descriptive statistics, and visualization methods (e.g., box plots, scatter plots).
   - The concept of the curse of dimensionality and the significance of outliers and correlations are also addressed.

6. **Data Cleaning and Preparation**:
   - The book outlines methods for handling missing values, outliers, and duplicated rows, as well as feature scaling and encoding categorical features.

7. **Model Evaluation and Validation**:
   - It introduces cross-validation techniques, including train-test splits and nested cross-validation, to ensure robust model evaluation.
   - The document also discusses issues like data leakage and concept drift.

8. **Regression and Classification Techniques**:
   - The book provides insights into various regression techniques (e.g., linear regression, polynomial regression) and classification methods (e.g., logistic regression, decision trees).
   - It covers model metrics for evaluating performance, such as RMSE, MAE, and accuracy scores.

9. **Ensemble Methods and Hyperparameter Optimization**:
   - The text discusses ensemble methods like Random Forest and boosting techniques, emphasizing their effectiveness in improving model performance.
   - Hyperparameter optimization strategies are also introduced to enhance model tuning.

10. **Feature Engineering and Selection**:
    - The importance of feature engineering and selection is highlighted, including techniques like PCA and LASSO for dimensionality reduction and feature importance assessment.

11. **Exclusion of Neural Networks**:
    - The author notes a deliberate choice to exclude deep learning and neural networks from the discussion, focusing instead on traditional machine learning methods.

### Conclusion:
The document serves as a comprehensive resource for understanding the foundational aspects of machine learning, particularly in the context of supervised learning with tabular data. It emphasizes the balance between predictive performance and model interpretability, providing practical insights and techniques for data scientists and practitioners.