Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
How The Internet Works
What is the internet, anyway?
The internet is a global network of devices that implement a protocol known as the Internet Protocol (IP). While it is possible to form small disconnected private networks using IP, only the planet-wide network of ISPs is "the" internet.
Okay, what is IP?
IP defines how packets of data are encoded and transmitted across a network. An IP packet is like an envelope of data; it has a source address, destination address, protocol and payload (among other things).
- Source address: IP address of the device which sent this packet.
- Destination: IP address of the recipient. Your internet provider, recipient's provider and everything in between is responsible for finding an efficient route for your packet to travel. This is another can of worms.
- Protocol: specifies format of the inner payload envelope, usually TCP, UDP or ICMP (ping).
- Payload: the contents of the envelope.
There are two versions of IP in use, IPv4 and IPv6. The main difference is that IPv6 is new and has support for longer addresses (since we've run out of IPv4 addresses), but is not widely supported yet.
IP addresses are managed hierarchically, with the IANA (Internet Assigned Numbers Authority) at the top. IANA allocates ranges of IP addresses to various regional authorities such as ARIN (American Registry for Internet Numbers), which then allocates smaller ranges to ISPs in the region. The ISPs then allocate them to businesses and residential customers.
Link and physical layer
When data is transmitted across a network, it is actually wrapped up in a series of envelopes, like an onion or matroshka dolls. The outer layer is the physical layer, which is the term given to the hardware actually transmitting bits of data over wires, fiber optics or radio waves. Inside that is the data-link layer, usually ethernet or wireless ethernet. This is managed by a chip on your network interface hardware and it splits data to be transmitted up into "frames", with a frame header describing its contents:
- Source MAC address
- Destination MAC address
- Type (IPv4/IPv6/ARP/RARP)
- CRC (cyclic redundancy check, to verify that the data was uncorrupted in transit)
A MAC (media access control) address is similar to an IP address, but it is used for routing frames between hardware devices. Every ethernet interface has a unique MAC address burned into it, the idea being that each device on a local network can identify itself for purposes of sending frames between each other.
There is a simple protocol called the Address Resolution Protocol (ARP) which lets machines on a network send a broadcast to find out what machine has a given IP address and map it to its MAC address. Say 10.0.0.1 wants to send an packet to 10.0.0.2:
- 10.0.0.1 broadcasts an ARP packet: "Who is 10.0.0.2? Tell 10.0.0.1"
- 10.0.0.2 sees the broadcast and replies with a reverse ARP packet: "I am 10.0.0.2, my MAC address is 40:6c:8f:25:f1:c9"
Simple enough. Now 10.0.0.1 can send packets to 10.0.0.2 via ethernet, because it knows what destination MAC to use.
So when we send a packet of data, it is usually an IP datagram inside of an ethernet frame. What's inside the IP datagram? Probably TCP or UDP.
TCP stands for transmission control protocol, and it describes how two computers can set up a "connection" between them, on a specific port number and reliably send and receive data. The port corresponds to an application protocol, allowing multiple applications to share the same IP address for different services.
For example, a server on the internet may be listening for TCP connections on port 80 (HTTP, the world wide web), 443 (HTTPS), 25 (email), 22 (SSH), and many more besides.
When you load up google.com in your web browser this is what your computer does:
- Locates the MAC address of its internet gateway with ARP (this step is usually skipped because the ARP response is cached for some time)
- Queries its DNS server for the IP address of google.com (18.104.22.168)
- Attempts to open a TCP connection by sending a TCP packet to port 80 on 22.214.171.124 with the SYN flag set. The IP destination header is set to 126.96.36.199, but the ethernet destination is set to your internet gateway MAC address
- Your internet gateway (your DSL modem, cable modem, etc) receives the datagram, strips off the ethernet frame header, encapsulates it in its own link-layer envelope and shoots it off to your ISP
- Your ISP routes the packet on to its destination using a complex system of routing known as BGP which we're not going to talk about right now
- 188.8.131.52 responds with a SYN-ACK, letting your operating system know that it can continue with the connection
- You then reply with an ACK, and now a TCP connection exists between you and google.com on port 80
- At this point a application-level protocol (HTTP) is used to reliably transfer the contents of the web page via TCP
- The connection is then closed
Aside from letting you create persistant connections between machines, TCP also does its best to guarantee delivery of packets. If a suitable reply from the destination is not received when a packet is sent, it will be automatically retransmitted.
UDP is similar in that it allows datagrams to be sent to a particular port, but it does not have the overhead of connection handshaking or retransmission, which makes it more efficient but less reliable. It is mostly used for real-time applications where data loss is not a major concern, such as IP telephony, audio or video streaming, and video games.
The domain name system is used to translate domain names into IP addresses. See this helpful illustration for more details.
A more technical term for the various envelopes encapsulating data packets is the Open System Interconnection model.