# Introduction to R and RStudio
- R is a statistical programming language used for data processing, statistical inference, data analysis, and machine learning.
- R is widely used in academia, healthcare, and government.
- R supports importing data from various sources and is known for producing great visualizations

## RStudio Environment
- RStudio is an integrated development environment for developing and running R language source code and programs.
- RStudio includes:
    - Syntax-highlighting editor for direct code execution
    - Console for typing R commands
    - Workspace and History tab for tracking R objects and commands
    - Files, Plots, Packages, and Help tabs for managing files, plots, packages, and help resources

## R Libraries for Data Science
- Popular R libraries for data science include:
    - dplyr for data manipulation
    - stringr for string manipulation
    - ggplot for data visualization
    - caret for machine learning

## Data Visualization in R

- R offers various data visualization packages: ggplot (most popular), plotly (web-based), lattice (complex data), leaflet (interactive maps). You can install them using install.packages("package_name").
- R also has built-in plotting functions: You can create basic scatterplots and customize them with lines and titles.
- ggplot library provides powerful visualizations: It allows adding layers (like points) and arguments (like axis labels) to create informative graphs. You can use functions like geom_point for scatterplots.
- GGally library extends ggplot: It simplifies creating complex visualizations by combining geometric objects with transformed data.

# Git and GitHub
- Git and GitHub are popular tools for version control and collaboration among developers and data scientists.
- Version control allows tracking changes to files, recovering older versions, and easier collaboration.

## What is Git?
- Git is free and open-source distributed version control software.
- Distributed means users can have a copy of the project locally and sync changes to a remote server.
- Git is widely used for code but can also version control other file types like images and documents.

## Basic Git Terms
- SSH: Secure remote login protocol
- Repository: Project folders set up for version control
- Fork: Copy of a repository
- Pull Request: Request to review and approve changes
- Working Directory: Files associated with a Git repository on your computer

## Basic Git Commands
- git init: Create a new repository
- git add: Move changes from working directory to staging area
- git status: See state of working directory and staged changes
- git commit: Commit staged changes to the project
- git reset: Undo changes in the working directory
- git log: Browse previous changes
- git branch: Create an isolated environment for changes
- git checkout: Switch between branches
- git merge: Merge branches together

## Learning Git
- GitHub has resources like cheat sheets and tutorials at [try.github.io](https://docs.github.com/en/get-started/getting-started-with-git) to learn essential Git commands.
- The course will provide a crash course on setting up local environment and getting started with a project.

## Purpose of Source Repositories and GitHub
- Source repositories are used to manage and track changes to source code during software development.
- **GitHub** is an online hosting service that satisfies the needs of a source repository.

## History of Git
- Git was developed in 2005 by Linus Torvalds to replace BitKeeper for Linux development.
- Key characteristics of Git include support for **non-linear development**, **distributed development**, **compatibility with existing systems**, **efficient handling of large projects**, **cryptographic authentication**, and **pluggable merge strategies**.

## Git Repository Model
- Git is a distributed version control system primarily focused on **tracking source code changes**.
- It allows coordination among programmers, tracking changes, and supporting non-linear workflows.
- Each developer has a local copy of the full development history, and changes are copied between repositories.
- There is typically a main branch for deployable code, and separate branches for ongoing work.

## GitHub
- GitHub is an **online hosting service for Git repositories**, owned by Microsoft.
- It offers free, professional, and enterprise accounts, with over 100 million repositories as of 2019.
- Repositories store documents like source code and enable version control and collaboration.

## GitLab
- GitLab is a complete DevOps platform that provides access to Git repositories and source code management.
- It allows developers to collaborate, review code, work from local copies, branch and merge code, and streamline testing and delivery with built-in CI/CD.


## Git Hub Repository
### Repository Tabs
- Code: Contains all source files
- Issues: Track and plan issues/items against the project
- Pull Requests: Mechanism for collaborating, reviewing changes before merging
Projects: Tools for project management and planning
Wiki, Security, Insights: Communication base, advanced features
Settings: Personalization and access control

# Tools for Data Science


| Term |Definition|
| :-:|:--------------------------------------------------------------------------------|
|Apache| MLlib Language that makes machine learning scalable|
|Apache| Spark A general-purpose cluster-computing framework allowingyou to process data using compute clusters|
|API| Application programming interface allows communicationbetween two pieces of software|
|Caffe| A deep learning algorithm repository built with C++with Python and Matlab bindings|
|CDLA| Community Data License Agreement|
|Classification| models Are used to predict whether some information|
|CLI| Command line interface|
|C++| A general-purpose programming language. It is an extension of the C programming language or C with Classes|
|Data set| A structured collection of data|
|Deeplearning4| Language for deep learning|
|Deep learning| A specialized type of machine learning. It refers to ageneral set of models and techniques that loosely emulate the way the human brain solves a wide range ofproblems
|ELT| Extract, Load, Transform| 
|ETL| Extract, Transform, and Load|
|FSF| Free Software Foundation|
|ggplot2| A popular library for data visualization in R|
|GPU| Graphics processing units|
|Git| De facto standard for code asset management, also knownas version management or version control. Around Gitemerged several services, GitHub, and GitLab|
|Hadoop| Application of Java which manages data processing andstorage for big data applications running in clusteredsystems|
|Java| Object-oriented programming language|
|Java-ML| Language for machine learning|
|JVM| Java Virtual Machine|
|JavaScript| A general-purpose language that extended beyond the browser with the creation of Node.js and other serverside approaches|
|Julia| A language for high-performance numerical analysis andcomputational science|
|Jupyter Notebook| A browser-based application that allows you to createand share documents containing code, equations,visualizations, narrative text links, and more|
|Jupyter Lab| A browser-based application that allows you to accessmultiple Jupyter Notebook files, other code, and datafiles|
|Kernel| An execution environment for the different programminglanguages|
|Lattice| It is a high-level data visualization library that canhandle graphics without customizations|
|Library| A collection of functions and methods that allow you toperform many actions without writing the code|
|Leaflet| Used for creating interactive plots|
|ML| Machine learning uses algorithms – also known as “models” - to identify patterns in the data|
|Matplotlib| package for data visualization|
|Model training| The process by which the model learns patterns fromdata|
|MNIST| Modified National Institute of Standards and Technology|
|MongoDB| A NoSQL database for big data management that was builtwith C++|
|NLP| Natural Language Processing|
|NLTK| Natural Language Toolkit|
|NumPy| Libraries are based on arrays and matrices, allowingyou to apply mathematical functions to the arrays|
|OSI |Open-Source Initiative|
|PaaS |Platform as a service|
|Pandas| A library that offers data structures and tools foreffective data cleaning, manipulation, and analysisPlotly Used for web-based data visualizations that can be displayed or saved as individual HTML files|
|PMML |Predictive Model Markup Language|
|Python| A high-level, general-purpose programming language. Ithas a large, standard library that provides tools suited to many different tasks, including Databases, Automation, Web scraping, Text processing, Image processing, Machine learning, and Data analytics|
|R| A statistical computing language|
|Regression models| Are used to predict a numeric (or “real”) value|
|Reinforcement Learning|Loosely based on the way human beings and other organisms learn.|
|REST | RE stands for Representationa the S stands for State,and the T stands for Transfer
|RStudio| Unifies programming, execution, debugging, remote dataaccess, data exploration, and visualization into onetool|
|SaaS |Software as a service|
|Scala |Is a combination of scalable and language. A generalpurpose programming language that provides support forfunctional programming and is a strong static typesystem
|Spyder| Integrates code, documentation, and visualizations,among others, into a single canvas|
|SQL| Structured Query Language that is non-procedural, usedfor querying and managing data|
|Supervised Learning |A learning in which a human provides input data andcorrect outputs|
|TensorFlow |Deep Learning library for dataflow that was built withC++|
|Unsupervised Learning |The data is not labeled by a human. Examples are Clustering models used to divide each record of a dataset into one of a similar group|
|Watson Studio| A fully integrated development environment for datascientists|
|Weka |Language for data mining|

